I am creating a new Dataframe from an existing dataframe, but need to add new column ("field1" in below code) in this new DF. How do I do so? Working sample code example will be appreciated.
val edwDf = omniDataFrame
.withColumn("field1", callUDF((value: String) => None))
.withColumn("field2",
callUdf("devicetypeUDF", (omniDataFrame.col("some_field_in_old_df"))))
edwDf
.select("field1", "field2")
.save("odsoutdatafldr", "com.databricks.spark.csv");
In PySpark, to add a new column to DataFrame use lit() function by importing from pyspark. sql. functions import lit , lit() function takes a constant value you wanted to add and returns a Column type, if you wanted to add a NULL / None use lit(None) .
The replacement of null values in PySpark DataFrames is one of the most common operations undertaken. This can be achieved by using either DataFrame. fillna() or DataFrameNaFunctions. fill() methods.
It is possible to use lit(null)
:
import org.apache.spark.sql.functions.{lit, udf}
case class Record(foo: Int, bar: String)
val df = Seq(Record(1, "foo"), Record(2, "bar")).toDF
val dfWithFoobar = df.withColumn("foobar", lit(null: String))
One problem here is that the column type is null
:
scala> dfWithFoobar.printSchema
root
|-- foo: integer (nullable = false)
|-- bar: string (nullable = true)
|-- foobar: null (nullable = true)
and it is not retained by the csv
writer. If it is a hard requirement you can cast column to the specific type (lets say String), with either DataType
import org.apache.spark.sql.types.StringType
df.withColumn("foobar", lit(null).cast(StringType))
or string description
df.withColumn("foobar", lit(null).cast("string"))
or use an UDF like this:
val getNull = udf(() => None: Option[String]) // Or some other type
df.withColumn("foobar", getNull()).printSchema
root
|-- foo: integer (nullable = false)
|-- bar: string (nullable = true)
|-- foobar: string (nullable = true)
A Python equivalent can be found here: Add an empty column to spark DataFrame
Just to extend the perfect answer provided by @zero323, here's a solution which can be used starting from Spark 2.2.0.
import org.apache.spark.sql.functions.typedLit
df.withColumn("foobar", typedLit[Option[String]](None)).printSchema
root
|-- foo: integer (nullable = false)
|-- bar: string (nullable = true)
|-- foobar: string (nullable = true)
It's similar to the 3rd solution, but without using any UDF.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With