Why is nullable = true used after some functions are executed even though there are no NaN values in the DataFrame.
val myDf = Seq((2,"A"),(2,"B"),(1,"C"))
.toDF("foo","bar")
.withColumn("foo", 'foo.cast("Int"))
myDf.withColumn("foo_2", when($"foo" === 2 , 1).otherwise(0)).select("foo", "foo_2").show
When df.printSchema is called now nullable will be false for both columns.
val foo: (Int => String) = (t: Int) => {
fooMap.get(t) match {
case Some(tt) => tt
case None => "notFound"
}
}
val fooMap = Map(
1 -> "small",
2 -> "big"
)
val fooUDF = udf(foo)
myDf
.withColumn("foo", fooUDF(col("foo")))
.withColumn("foo_2", when($"foo" === 2 , 1).otherwise(0)).select("foo", "foo_2")
.select("foo", "foo_2")
.printSchema
However now, nullable is true for at least one column which was false before. How can this be explained?
Nullable indicates if the concerned column can be null or not. It ensures that a specific column can't be null (if it's null while the nullable property is set to true, Spark will launch a java.
In Spark, using filter() or where() functions of DataFrame we can filter rows with NULL values by checking IS NULL or isNULL . These removes all rows with null values on state column and returns the new DataFrame. All above examples returns the same output.
Method 1: Using DataFrame.withColumn() The DataFrame. withColumn(colName, col) returns a new DataFrame by adding a column or replacing the existing column that has the same name. We will make use of cast(x, dataType) method to casts the column to a different data type.
When creating Dataset from statically typed structure (without depending on schema argument) Spark uses a relatively simple set of rules to determine nullable property.
null then its DataFrame representation is nullable.Option[_] then then its DataFrame representation is nullable with None considered to be SQL NULL.nullable.Since Scala String is java.lang.String, which can be null, generated column can is nullable. For the same reason bar column is nullable in the initial dataset:
val data1 = Seq[(Int, String)]((2, "A"), (2, "B"), (1, "C"))
val df1 = data1.toDF("foo", "bar")
df1.schema("bar").nullable
Boolean = true
but foo is not (scala.Int cannot be null).
df1.schema("foo").nullable
Boolean = false
If we change data definition to:
val data2 = Seq[(Integer, String)]((2, "A"), (2, "B"), (1, "C"))
foo will be nullable (Integer is java.lang.Integer and boxed integer can be null):
data2.toDF("foo", "bar").schema("foo").nullable
Boolean = true
See also: SPARK-20668 Modify ScalaUDF to handle nullability.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With