Scala Dataframe null check for columns

Question

val new_df = df.filter($"type_interne" !== "" || $"type_interne" !== "null")

Give me error value || is not member of string

When i use === works well for filter

val new_df = df.filter($"type_interne" === "" || $"type_interne" === "null")

Nikita · Accepted Answer

Though Raphael's answer was fully correct at the time of writing, spark evolving... Operator !== is deprecated since version 2.0, but you can use =!= which solves precedence problem above without using parenthesis. See corresponding comments in source code: https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L319-L320

Detailed answer:
I'd like also to note something which was not obvious for me in the beginning. There are notions of DataFrame (DF) and DataSet (DS), which also divide their usage in above context into:
1) strings which interpreted by catalyst (error are caught up only in run time) - both DF and DS case class NullStrings(n: Int, s: String)

val df = spark.sparkContext.parallelize(Seq(
    (1, "abc"),
    (2, "ABC"),
    (3, null),
    (4, ""))
).toDF("n", "s")

df.filter("s is not null and s != ''").show()

+---+---+
|  n|  s|
+---+---+
|  1|abc|
|  2|ABC|
+---+---+

2) dataframe syntax using Column notion ($ with spark.implicits._ import) partially compile checked:

df.filter($"s" =!= "" || $"s" =!= null).show()

but in fact =!= ignores nulls (see <=> for null-safe comparison), hence below is equal to

df.filter($"s" =!= "").show()

+---+---+
|  n|  s|
+---+---+
|  1|abc|
|  2|ABC|
+---+---+

3) dataset

val ds = df.as[NullStrings]

ds.filter(r => r.s != null && r.s.nonEmpty).show()
+---+---+
|  n|  s|
+---+---+
|  1|abc|
|  2|ABC|
+---+---+

Beware if you use Option in case class, you have to deal with it, not simple string.

case class NullStringsOption(n: Int, s: Option[String])

val ds1 = df.as[NullStringsOption]

ds1.filter(_.s.exists(_.nonEmpty)).show()

Scala Dataframe null check for columns

Tags:

dataframe

scala

apache-spark

apache-spark-sql

Subhod Lagade

1 Answers

Nikita

Recent Activity

Donate For Us

Scala Dataframe null check for columns

Tags:

dataframe

scala

apache-spark

apache-spark-sql

Subhod Lagade

1 Answers

Nikita

Related questions

Recent Activity

Donate For Us