I am creating a new Dataframe from an existing dataframe, but need to add new column ("field1" in below code) in this new DF. How do I do so? Working sample code example will be appreciated. <pre class="prettyprint"><code>val edwDf = omniDataFrame .withColumn("field1", callUDF((value: String) => None)) .withColumn("field2", callUdf("devicetypeUDF", (omniDataFrame.col("some_field_in_old_df")))) edwDf .select("field1", "field2") .save("odsoutdatafldr", "com.databricks.spark.csv"); </code></pre>

It is possible to use <code>lit(null)</code>: <pre class="prettyprint lang-scala prettyprint-override"><code>import org.apache.spark.sql.functions.{lit, udf} case class Record(foo: Int, bar: String) val df = Seq(Record(1, "foo"), Record(2, "bar")).toDF val dfWithFoobar = df.withColumn("foobar", lit(null: String)) </code></pre> One problem here is that the column type is <code>null</code>: <pre class="prettyprint lang-scala prettyprint-override"><code>scala> dfWithFoobar.printSchema root |-- foo: integer (nullable = false) |-- bar: string (nullable = true) |-- foobar: null (nullable = true) </code></pre> and it is not retained by the <code>csv</code> writer. If it is a hard requirement you can cast column to the specific type (lets say String), with either <code>DataType</code> <pre class="prettyprint lang-scala prettyprint-override"><code>import org.apache.spark.sql.types.StringType df.withColumn("foobar", lit(null).cast(StringType)) </code></pre> or string description <pre class="prettyprint lang-scala prettyprint-override"><code>df.withColumn("foobar", lit(null).cast("string")) </code></pre> or use an UDF like this: <pre class="prettyprint lang-scala prettyprint-override"><code>val getNull = udf(() => None: Option[String]) // Or some other type df.withColumn("foobar", getNull()).printSchema root |-- foo: integer (nullable = false) |-- bar: string (nullable = true) |-- foobar: string (nullable = true) </code></pre> A Python equivalent can be found here: Add an empty column to spark DataFrame

Create new Dataframe with empty/null field values

Tags:

dataframe

scala

apache-spark

apache-spark-sql

I am creating a new Dataframe from an existing dataframe, but need to add new column ("field1" in below code) in this new DF. How do I do so? Working sample code example will be appreciated.

val edwDf = omniDataFrame 
  .withColumn("field1", callUDF((value: String) => None)) 
  .withColumn("field2",
    callUdf("devicetypeUDF", (omniDataFrame.col("some_field_in_old_df")))) 

edwDf
  .select("field1", "field2")
  .save("odsoutdatafldr", "com.databricks.spark.csv");

600

asked Aug 18 '15 08:08

sshroff

2 Answers

It is possible to use lit(null):

import org.apache.spark.sql.functions.{lit, udf}

case class Record(foo: Int, bar: String)
val df = Seq(Record(1, "foo"), Record(2, "bar")).toDF

val dfWithFoobar = df.withColumn("foobar", lit(null: String))

One problem here is that the column type is null:

scala> dfWithFoobar.printSchema
root
 |-- foo: integer (nullable = false)
 |-- bar: string (nullable = true)
 |-- foobar: null (nullable = true)

and it is not retained by the csv writer. If it is a hard requirement you can cast column to the specific type (lets say String), with either DataType

import org.apache.spark.sql.types.StringType

df.withColumn("foobar", lit(null).cast(StringType))

or string description

df.withColumn("foobar", lit(null).cast("string"))

or use an UDF like this:

val getNull = udf(() => None: Option[String]) // Or some other type

df.withColumn("foobar", getNull()).printSchema
root
 |-- foo: integer (nullable = false)
 |-- bar: string (nullable = true)
 |-- foobar: string (nullable = true)

A Python equivalent can be found here: Add an empty column to spark DataFrame

171

answered Oct 27 '22 11:10

zero323

Just to extend the perfect answer provided by @zero323, here's a solution which can be used starting from Spark 2.2.0.

import org.apache.spark.sql.functions.typedLit

df.withColumn("foobar", typedLit[Option[String]](None)).printSchema
root
 |-- foo: integer (nullable = false)
 |-- bar: string (nullable = true)
 |-- foobar: string (nullable = true)

It's similar to the 3rd solution, but without using any UDF.

answered Oct 27 '22 10:10

sanyi14ka

Related questions
                            
                                How to turn off parallel execution of tests for multi-project builds?
                            
                                Static inner classes in scala
                            
                                Could not find or load main class in scala in intellij IDE
                            
                                How to pass a tuple argument the best way?
                            
                                Computing the MD5 hash of a string in scala [duplicate]
                            
                                Scala case class extending Product with Serializable
                            
                                Explanation of singleton objects in Scala
                            
                                Scala char to int conversion
                            
                                What's the new way to iterate over a Java Map in Scala 2.8.0?
                            
                                Why are singleton objects more object-oriented?
                            
                                define your own exceptions with overloaded constructors in scala
                            
                                Scala Some v. Option
                            
                                What exactly makes Option a monad in Scala?
                            
                                What are some examples of type-level programming? [closed]
                            
                                How to turn json to case class when case class has only one field
                            
                                akka HttpResponse read body as String scala
                            
                                How to run external jar functions in spark-shell
                            
                                Mixing object-oriented and functional programming
                            
                                How to count occurrences of each distinct value for every column in a dataframe?
                            
                                Filter Spark DataFrame by checking if value is in a list, with other criteria

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With