Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter out rows with NaN values for certain column

I have a dataset and in some of the rows an attribute value is NaN. This data is loaded into a dataframe and I would like to only use the rows which consist of rows where all attribute have values. I tried doing it via sql:

val df_data = sqlContext.sql("SELECT * FROM raw_data WHERE attribute1 != NaN")

I tried several variants on this, but I can't seem to get it working.

Another option would be to transform it to a RDD and then filter it, since filtering this dataframe to check if a attribute isNaN , does not work.

like image 883
Olivier_s_j Avatar asked May 27 '15 07:05

Olivier_s_j


People also ask

How do I filter NaN rows?

You can filter out rows with NAN value from pandas DataFrame column string, float, datetime e.t.c by using DataFrame. dropna() and DataFrame. notnull() methods. Python doesn't support Null hence any missing data is represented as None or NaN.


Video Answer


3 Answers

I know you accepted the other answer, but you can do it without the explode (which should perform better than doubling your DataFrame size).

Prior to Spark 1.6, you could use a udf like this:

def isNaNudf = udf[Boolean,Double](d => d.isNaN)
df.filter(isNaNudf($"value"))

As of Spark 1.6, you can now use the built-in SQL function isnan() like this:

df.filter(isnan($"value"))
like image 116
David Griffin Avatar answered Oct 29 '22 06:10

David Griffin


Here is some sample code that shows you my way of doing it -

import sqlContext.implicits._
val df = sc.parallelize(Seq((1, 0.5), (2, Double.NaN))).toDF("id", "value")
val df2 = df.explode[Double, Boolean]("value", "isNaN")(d => Seq(d.isNaN))

df will have -

df.show

id value
1  0.5  
2  NaN

while doing filter on df2 will give you what you want -

df2.filter($"isNaN" !== true).show

id value isNaN
1  0.5   false 
like image 20
Wesley Miao Avatar answered Oct 29 '22 06:10

Wesley Miao


This works:

where isNaN(tau_doc) = false

e.g.

val df_data = sqlContext.sql("SELECT * FROM raw_data where isNaN(attribute1) = false")
like image 2
hyokyun.park Avatar answered Oct 29 '22 06:10

hyokyun.park