Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark dataframe select rows with at least one null or blank in any column of that row

from one dataframe i want to create a new dataframe where at least one value in any of the columns is null or blank in spark 1.5 / scala.

i am trying to write a generalize function to create this new dataframe. where i pass the dataframe and the list of columns and creates the record.

Thanks

like image 816
user1122 Avatar asked Dec 13 '22 21:12

user1122


1 Answers

Sample Data:

val df = Seq((null, Some(2)), (Some("a"), Some(4)), (Some(""), Some(5)), (Some("b"), null)).toDF("A", "B")

df.show
+----+----+
|   A|   B|
+----+----+
|null|   2|
|   a|   4|
|    |   5|
|   b|null|
+----+----+  

You can construct the condition as, assume blank means empty string here:

import org.apache.spark.sql.functions.col
val cond = df.columns.map(x => col(x).isNull || col(x) === "").reduce(_ || _)

df.filter(cond).show
+----+----+
|   A|   B|
+----+----+
|null|   2|
|    |   5|
|   b|null|
+----+----+
like image 163
Psidom Avatar answered Dec 21 '22 10:12

Psidom