Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to drop rows with too many NULL values?

I want to do some preprocessing on my data and I want to drop the rows that are sparse (for some threshold value).

For example I have a dataframe table with 10 features, and I have a row with 8 null value, then I want to drop it.

I found some related topics but I cannot find any useful information for my purpose.

stackoverflow.com/questions/3473778/count-number-of-nulls-in-a-row

Examples like in the link above won't work for me, because I want to do this preprocessing automatically. I cannot write the column names and do something accordingly.

So is there anyway to do this delete operation without using the column names in Apache Spark with scala?

like image 345
Merve Bozo Avatar asked Mar 17 '16 14:03

Merve Bozo


1 Answers

I'm surprised that no answers pointed out that Spark SQL comes with few standard functions that meet the requirement:

For example I have a dataframe table with 10 features, and I have a row with 8 null value, then I want to drop it.

You could use one of the variants of DataFrameNaFunctions.drop method with minNonNulls set appropriately, say 2.

drop(minNonNulls: Int, cols: Seq[String]): DataFrame Returns a new DataFrame that drops rows containing less than minNonNulls non-null and non-NaN values in the specified columns.

And to meet the variability in the column names as in the requirement:

I cannot write the column names and do something accordingly.

You can simply use Dataset.columns:

columns: Array[String] Returns all column names as an array.


Let say you've got the following dataset with 5 features (columns) and few rows almost all nulls.

val ns: String = null
val features = Seq(("0","1","2",ns,ns), (ns, ns, ns, ns, ns), (ns, "1", ns, "2", ns)).toDF
scala> features.show
+----+----+----+----+----+
|  _1|  _2|  _3|  _4|  _5|
+----+----+----+----+----+
|   0|   1|   2|null|null|
|null|null|null|null|null|
|null|   1|null|   2|null|
+----+----+----+----+----+

// drop rows with more than (5 columns - 2) = 3 nulls
scala> features.na.drop(2, features.columns).show
+----+---+----+----+----+
|  _1| _2|  _3|  _4|  _5|
+----+---+----+----+----+
|   0|  1|   2|null|null|
|null|  1|null|   2|null|
+----+---+----+----+----+
like image 94
Jacek Laskowski Avatar answered Sep 30 '22 19:09

Jacek Laskowski