Why do columns change to nullable in Apache Spark SQL?

Tags:

Why is nullable = true used after some functions are executed even though there are no NaN values in the DataFrame.

val myDf = Seq((2,"A"),(2,"B"),(1,"C"))
         .toDF("foo","bar")
         .withColumn("foo", 'foo.cast("Int"))

myDf.withColumn("foo_2", when($"foo" === 2 , 1).otherwise(0)).select("foo", "foo_2").show

When df.printSchema is called now nullable will be false for both columns.

val foo: (Int => String) = (t: Int) => {
    fooMap.get(t) match {
      case Some(tt) => tt
      case None => "notFound"
    }
  }

val fooMap = Map(
    1 -> "small",
    2 -> "big"
 )
val fooUDF = udf(foo)

myDf
    .withColumn("foo", fooUDF(col("foo")))
    .withColumn("foo_2", when($"foo" === 2 , 1).otherwise(0)).select("foo", "foo_2")
    .select("foo", "foo_2")
    .printSchema

However now, nullable is true for at least one column which was false before. How can this be explained?

951

asked Nov 15 '16 06:11

Georg Heiler

1 Answers

When creating Dataset from statically typed structure (without depending on schema argument) Spark uses a relatively simple set of rules to determine nullable property.

If object of the given type can be null then its DataFrame representation is nullable.
If object is an Option[_] then then its DataFrame representation is nullable with None considered to be SQL NULL.
In any other case it will be marked as not nullable.

Since Scala String is java.lang.String, which can be null, generated column can is nullable. For the same reason bar column is nullable in the initial dataset:

val data1 = Seq[(Int, String)]((2, "A"), (2, "B"), (1, "C"))
val df1 = data1.toDF("foo", "bar")
df1.schema("bar").nullable

Boolean = true

but foo is not (scala.Int cannot be null).

df1.schema("foo").nullable

Boolean = false

If we change data definition to:

val data2 = Seq[(Integer, String)]((2, "A"), (2, "B"), (1, "C"))

foo will be nullable (Integer is java.lang.Integer and boxed integer can be null):

data2.toDF("foo", "bar").schema("foo").nullable

Boolean = true

See also: SPARK-20668 Modify ScalaUDF to handle nullability.

139

answered Sep 30 '22 13:09

zero323

Related questions
                            
                                pyspark csv at url to dataframe, without writing to disk
                            
                                Spark: Order of column arguments in repartition vs partitionBy
                            
                                Spark Streaming Accumulated Word Count
                            
                                Saving to parquet subpartition
                            
                                How do I apply schema with nullable = false to json reading
                            
                                Why does the Spark DataFrame conversion to RDD require a full re-mapping?
                            
                                PySpark distributed processing on a YARN cluster
                            
                                How do I visualise / plot a decision tree in Apache Spark (PySpark 1.4.1)?
                            
                                Where does spark look for text files?
                            
                                Spark DataFrame InsertIntoJDBC - TableAlreadyExists Exception
                            
                                How to pass data from Kafka to Spark Streaming?
                            
                                Spark Driver Memory and Executor Memory
                            
                                Retain keys with null values while writing JSON in spark
                            
                                How to detect Databricks environment programmatically
                            
                                Apache Spark: Job aborted due to stage failure: "TID x failed for unknown reasons"
                            
                                How to convert spark SchemaRDD into RDD of my case class?
                            
                                "No Filesystem for Scheme: gs" when running spark job locally
                            
                                Running Spark jobs on a YARN cluster with additional files
                            
                                Append a new column to an existing parquet file
                            
                                Spark reading python3 pickle as input

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why do columns change to nullable in Apache Spark SQL?

Tags:

apache-spark

apache-spark-sql

apache-spark-dataset

Georg Heiler

People also ask

1 Answers

zero323

Recent Activity

Donate For Us