Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Check if arraytype column contains null

I have a dataframe with a column of arraytype that can contain integer values. If no values it will contain only one and it will be the null value

Important: note the column will not be null but an array with a single value; null

> val df: DataFrame  = Seq(("foo", Seq(Some(2), Some(3))), ("bar", Seq(None))).toDF("k", "v")
df: org.apache.spark.sql.DataFrame = [k: string, v: array<int>]
> df.show()
+---+------+
|  k|     v|
+---+------+
|foo|[2, 3]|
|bar|[null]|

Question: I'd like to get the rows that have the null value.


What I have tried thus far:

> df.filter(array_contains(df("v"), 2)).show()
+---+------+
|  k|     v|
+---+------+
|foo|[2, 3]|
+---+------+

for null, it does not seem to work

> df.filter(array_contains(df("v"), null)).show()

org.apache.spark.sql.AnalysisException: cannot resolve 'array_contains(v, NULL)' due to data type mismatch: Null typed values cannot be used as arguments;

or

> df.filter(array_contains(df("v"), None)).show()

java.lang.RuntimeException: Unsupported literal type class scala.None$ None

like image 952
Vassilis Moustakas Avatar asked Jun 01 '17 12:06

Vassilis Moustakas


2 Answers

It is not possible to use array_contains in this case because SQL NULL cannot be compared for equality.

You can use udf like this:

val contains_null = udf((xs: Seq[Integer]) => xs.contains(null))

df.where(contains_null($"v")).show

// +---+------+
// |  k|     v|
// +---+------+
// |bar|[null]|
like image 90
eliasah Avatar answered Sep 20 '22 14:09

eliasah


For Spark 2.4+, you can use the higher-order function exists instead of UDF:

df.where("exists(v, x -> x is null)").show

//+---+---+
//|  k|  v|
//+---+---+
//|bar| []|
//+---+---+
like image 32
blackbishop Avatar answered Sep 20 '22 14:09

blackbishop