val new_df = df.filter($"type_interne" !== "" || $"type_interne" !== "null")
Give me error value ||
is not member of string
When i use ===
works well for filter
val new_df = df.filter($"type_interne" === "" || $"type_interne" === "null")
Though Raphael's answer was fully correct at the time of writing, spark evolving...
Operator !==
is deprecated since version 2.0, but you can use =!=
which solves precedence problem above without using parenthesis. See corresponding comments in source code:
https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L319-L320
Detailed answer:
I'd like also to note something which was not obvious for me in the beginning.
There are notions of DataFrame (DF) and DataSet (DS), which also divide their usage in above context into:
1) strings which interpreted by catalyst (error are caught up only in run time) - both DF and DS
case class NullStrings(n: Int, s: String)
val df = spark.sparkContext.parallelize(Seq(
(1, "abc"),
(2, "ABC"),
(3, null),
(4, ""))
).toDF("n", "s")
df.filter("s is not null and s != ''").show()
+---+---+
| n| s|
+---+---+
| 1|abc|
| 2|ABC|
+---+---+
2) dataframe syntax using Column
notion ($
with spark.implicits._
import) partially compile checked:
df.filter($"s" =!= "" || $"s" =!= null).show()
but in fact =!=
ignores nulls (see <=>
for null-safe comparison), hence below is equal to
df.filter($"s" =!= "").show()
+---+---+
| n| s|
+---+---+
| 1|abc|
| 2|ABC|
+---+---+
3) dataset
val ds = df.as[NullStrings]
ds.filter(r => r.s != null && r.s.nonEmpty).show()
+---+---+
| n| s|
+---+---+
| 1|abc|
| 2|ABC|
+---+---+
Beware if you use Option
in case class, you have to deal with it, not simple string.
case class NullStringsOption(n: Int, s: Option[String])
val ds1 = df.as[NullStringsOption]
ds1.filter(_.s.exists(_.nonEmpty)).show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With