Given a dataframe :
val df = sc.parallelize(Seq(("foo", ArrayBuffer(null,"bar",null)), ("bar", ArrayBuffer("one","two",null)))).toDF("key", "value")
df.show
+---+--------------------------+
|key| value|
+---+--------------------------+
|foo|ArrayBuffer(null,bar,null)|
|bar|ArrayBuffer(one, two,null)|
+---+--------------------------+
I'd like to drop null from column value. After removal the dataframe should look like this :
+---+--------------------------+
|key| value|
+---+--------------------------+
|foo|ArrayBuffer(bar) |
|bar|ArrayBuffer(one, two) |
+---+--------------------------+
Any suggestion welcome . 10x
You'll need an UDF here. For example with a flatMap:
val filterOutNull = udf((xs: Seq[String]) =>
Option(xs).map(_.flatMap(Option(_))))
df.withColumn("value", filterOutNull($"value"))
where external Option with map handles NULL columns:
Option(null: Seq[String]).map(identity)
Option[Seq[String]] = None
Option(Seq("foo", null, "bar")).map(identity)
Option[Seq[String]] = Some(List(foo, null, bar))
and ensures we don't fail with NPE when input is NULL / null by mapping
NULL -> null -> None -> None -> NULL
where null is a Scala null and NULL is a SQL NULL.
The internal flatMap flattens a sequence of Options effectively filtering nulls:
Seq("foo", null, "bar").flatMap(Option(_))
Seq[String] = List(foo, bar)
A more imperative equivalent could be something like this:
val imperativeFilterOutNull = udf((xs: Seq[String]) =>
if (xs == null) xs
else for {
x <- xs
if x != null
} yield x)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With