I have two dataframes called left and right. <pre class="prettyprint"><code>scala> left.printSchema root |-- user_uid: double (nullable = true) |-- labelVal: double (nullable = true) |-- probability_score: double (nullable = true) scala> right.printSchema root |-- user_uid: double (nullable = false) |-- real_labelVal: double (nullable = false) </code></pre> Then, I join them to get the joined Dataframe. It is a left outer join. Anyone interested in the natjoin function can find it here. <blockquote> https://gist.github.com/anonymous/f02bd79528ac75f57ae8 </blockquote> <pre class="prettyprint"><code>scala> val joinedData = natjoin(predictionDataFrame, labeledObservedDataFrame, "left_outer") scala> joinedData.printSchema |-- user_uid: double (nullable = true) |-- labelVal: double (nullable = true) |-- probability_score: double (nullable = true) |-- real_labelVal: double (nullable = false) </code></pre> Since it is a left outer join, the real_labelVal column has nulls when user_uid is not present in right. <pre class="prettyprint"><code>scala> val realLabelVal = joinedData.select("real_labelval").distinct.collect realLabelVal: Array[org.apache.spark.sql.Row] = Array([0.0], [null]) </code></pre> I want to replace the null values in the realLabelVal column with 1.0. Currently I do the following: <ol> <li>I find the index of real_labelval column and use the spark.sql.Row API to set the nulls to 1.0. (This gives me a RDD[Row])</li> <li>Then I apply the schema of the joined dataframe to get the cleaned dataframe.</li> </ol> The code is as follows: <pre class="prettyprint"><code> val real_labelval_index = 3 def replaceNull(row: Row) = { val rowArray = row.toSeq.toArray rowArray(real_labelval_index) = 1.0 Row.fromSeq(rowArray) } val cleanRowRDD = joinedData.map(row => if (row.isNullAt(real_labelval_index)) replaceNull(row) else row) val cleanJoined = sqlContext.createDataFrame(cleanRowRdd, joinedData.schema) </code></pre> Is there an elegant or efficient way to do this? Goolging hasn't helped much. Thanks in advance.

Have you tried using <code>na</code> <pre class="prettyprint"><code>joinedData.na.fill(1.0, Seq("real_labelval")) </code></pre>

Replacing null values with 0 after spark dataframe left outer join

Tags:

join

scala

apache-spark

spark-dataframe

I have two dataframes called left and right.

scala> left.printSchema
root
|-- user_uid: double (nullable = true)
|-- labelVal: double (nullable = true)
|-- probability_score: double (nullable = true)

scala> right.printSchema
root
|-- user_uid: double (nullable = false)
|-- real_labelVal: double (nullable = false)

Then, I join them to get the joined Dataframe. It is a left outer join. Anyone interested in the natjoin function can find it here.

https://gist.github.com/anonymous/f02bd79528ac75f57ae8

scala> val joinedData = natjoin(predictionDataFrame, labeledObservedDataFrame, "left_outer")

scala> joinedData.printSchema
|-- user_uid: double (nullable = true)
|-- labelVal: double (nullable = true)
|-- probability_score: double (nullable = true)
|-- real_labelVal: double (nullable = false)

Since it is a left outer join, the real_labelVal column has nulls when user_uid is not present in right.

scala> val realLabelVal = joinedData.select("real_labelval").distinct.collect
realLabelVal: Array[org.apache.spark.sql.Row] = Array([0.0], [null])

I want to replace the null values in the realLabelVal column with 1.0.

Currently I do the following:

I find the index of real_labelval column and use the spark.sql.Row API to set the nulls to 1.0. (This gives me a RDD[Row])
Then I apply the schema of the joined dataframe to get the cleaned dataframe.

The code is as follows:

 val real_labelval_index = 3
 def replaceNull(row: Row) = {
    val rowArray = row.toSeq.toArray
     rowArray(real_labelval_index) = 1.0
     Row.fromSeq(rowArray)
 }

 val cleanRowRDD = joinedData.map(row => if (row.isNullAt(real_labelval_index)) replaceNull(row) else row)
 val cleanJoined = sqlContext.createDataFrame(cleanRowRdd, joinedData.schema)

Is there an elegant or efficient way to do this?

Goolging hasn't helped much. Thanks in advance.

274

asked Aug 04 '15 01:08

Mihir Shinde

1 Answers

Have you tried using na

joinedData.na.fill(1.0, Seq("real_labelval"))

164

answered Oct 22 '22 21:10

Justin Pihony

Related questions
                            
                                How to use a case classes when hierarchy is needed?
                            
                                What's the difference between currying and multiple parameter lists?
                            
                                Is there a 'lazy map'?
                            
                                How to split a dataframe into dataframes with same column values?
                            
                                Spark: RDD to List
                            
                                How to render JSON response in Play framework v2.0 (latest build from GIT)
                            
                                Is foreach by-definition guaranteed to iterate the subject collection sequentially in Scala?
                            
                                Play 2 - Set header on all responses?
                            
                                What is the naming convention for typeclasses in Scala?
                            
                                scala copy objects
                            
                                I can haz no package-private class in Scala?
                            
                                Is there a way to define the required sbt version in build.sbt?
                            
                                How to write tail-recursive functions when working inside monads
                            
                                How do I parse an xml document as a stream using Scala?
                            
                                How to create annotations and get them in scala
                            
                                Scala as a shell script: jars on the classpath
                            
                                Scala - Iterator over all the lines in the files in a directory
                            
                                scala append to a mutable LinkedList
                            
                                Embedded Scala REPL Interpreter example for 2.10
                            
                                What language is Scala written in, and where can the source be found? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With