I have a dataset and in some of the rows an attribute value is <code>NaN</code>. This data is loaded into a dataframe and I would like to only use the rows which consist of rows where all attribute have values. I tried doing it via sql: <pre class="prettyprint"><code>val df_data = sqlContext.sql("SELECT * FROM raw_data WHERE attribute1 != NaN") </code></pre> I tried several variants on this, but I can't seem to get it working. Another option would be to transform it to a RDD and then filter it, since filtering this dataframe to check if a attribute <code>isNaN</code> , does not work.

I know you accepted the other answer, but you can do it without the <code>explode</code> (which should perform better than doubling your DataFrame size). Prior to Spark 1.6, you could use a <code>udf</code> like this: <pre class="prettyprint"><code>def isNaNudf = udf[Boolean,Double](d => d.isNaN) df.filter(isNaNudf($"value")) </code></pre> As of Spark 1.6, you can now use the built-in SQL function <code>isnan()</code> like this: <pre class="prettyprint"><code>df.filter(isnan($"value")) </code></pre>

Here is some sample code that shows you my way of doing it - <pre class="prettyprint"><code>import sqlContext.implicits._ val df = sc.parallelize(Seq((1, 0.5), (2, Double.NaN))).toDF("id", "value") val df2 = df.explode[Double, Boolean]("value", "isNaN")(d => Seq(d.isNaN)) </code></pre> df will have - <pre class="prettyprint"><code>df.show id value 1 0.5 2 NaN </code></pre> while doing filter on df2 will give you what you want - <pre class="prettyprint"><code>df2.filter($"isNaN" !== true).show id value isNaN 1 0.5 false </code></pre>

This works: <pre class="prettyprint"><code>where isNaN(tau_doc) = false </code></pre> e.g. <pre class="prettyprint"><code>val df_data = sqlContext.sql("SELECT * FROM raw_data where isNaN(attribute1) = false") </code></pre>

Filter out rows with NaN values for certain column

Tags:

scala

apache-spark

apache-spark-sql

I have a dataset and in some of the rows an attribute value is NaN. This data is loaded into a dataframe and I would like to only use the rows which consist of rows where all attribute have values. I tried doing it via sql:

val df_data = sqlContext.sql("SELECT * FROM raw_data WHERE attribute1 != NaN")

I tried several variants on this, but I can't seem to get it working.

Another option would be to transform it to a RDD and then filter it, since filtering this dataframe to check if a attribute isNaN , does not work.

883

asked May 27 '15 07:05

Olivier_s_j

Video Answer

3 Answers

I know you accepted the other answer, but you can do it without the explode (which should perform better than doubling your DataFrame size).

Prior to Spark 1.6, you could use a udf like this:

def isNaNudf = udf[Boolean,Double](d => d.isNaN)
df.filter(isNaNudf($"value"))

As of Spark 1.6, you can now use the built-in SQL function isnan() like this:

df.filter(isnan($"value"))

116

answered Oct 29 '22 06:10

David Griffin

Here is some sample code that shows you my way of doing it -

import sqlContext.implicits._
val df = sc.parallelize(Seq((1, 0.5), (2, Double.NaN))).toDF("id", "value")
val df2 = df.explode[Double, Boolean]("value", "isNaN")(d => Seq(d.isNaN))

df will have -

df.show

id value
1  0.5  
2  NaN

while doing filter on df2 will give you what you want -

df2.filter($"isNaN" !== true).show

id value isNaN
1  0.5   false

answered Oct 29 '22 06:10

Wesley Miao

This works:

where isNaN(tau_doc) = false

e.g.

val df_data = sqlContext.sql("SELECT * FROM raw_data where isNaN(attribute1) = false")

answered Oct 29 '22 06:10

hyokyun.park

Related questions
                            
                                Scala find location of string in a string
                            
                                Combining Scala Option[Iterable[_]]
                            
                                Evaluate string command in Scala from REPL
                            
                                Is there a SBT equivalent for Maven POM properties?
                            
                                erlang vs jvm (scala) recursion performance [closed]
                            
                                How to add cache mechanism when using anorm in Playframework
                            
                                Convert scala list to Json object
                            
                                Using Scala 2.10 `to` to convert a List to a SortedMap
                            
                                Scala for-comprehension type inference
                            
                                Calculating differences of subsequent elements of a sequence in scala
                            
                                Why Scala's traits execute from right to left?
                            
                                How to set up mesos for running spark on standalone OS/X
                            
                                Scala count number of occurences of an element in a Map
                            
                                Use Future in Spray Routing
                            
                                How to fix the pattern-matching exhaustive warning?
                            
                                Calling an Actor in a Spray route and waiting for the Actor's response
                            
                                Ungrouping a (key, list(values)) pair in Spark/Scala
                            
                                How to sum values and group them by a key value in Scala's List of Map?
                            
                                In Scala invoking no-parameter function with and without brackets is executed in different way
                            
                                Scala forall to compare two lists?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Filter out rows with NaN values for certain column

Tags:

scala

apache-spark

apache-spark-sql

Olivier_s_j

People also ask

Video Answer

3 Answers

David Griffin

Wesley Miao

hyokyun.park

Recent Activity

Donate For Us