I'm trying to filter one dataframe against another: <pre class="prettyprint"><code>scala> val df1 = sc.parallelize((1 to 100).map(a=>(s"user $a", a*0.123, a))).toDF("name", "score", "user_id") scala> val df2 = sc.parallelize(List(2,3,4,5,6)).toDF("valid_id") </code></pre> Now I want to filter df1 and get back a dataframe that contains all the rows in df1 where user_id is in df2("valid_id"). In other words, I want all the rows in df1 where the user_id is either 2,3,4,5 or 6 <pre class="prettyprint"><code>scala> df1.select("user_id").filter($"user_id" in df2("valid_id")) warning: there were 1 deprecation warning(s); re-run with -deprecation for details org.apache.spark.sql.AnalysisException: resolved attribute(s) valid_id#20 missing from user_id#18 in operator !Filter user_id#18 IN (valid_id#20); </code></pre> On the other hand when I try to do a filter against a function, everything looks great: <pre class="prettyprint"><code>scala> df1.select("user_id").filter(($"user_id" % 2) === 0) res1: org.apache.spark.sql.DataFrame = [user_id: int] </code></pre> Why am I getting this error? Is there something wrong with my syntax? following comment I have tried to do a left outer join: <pre class="prettyprint"><code>scala> df1.show +-------+------------------+-------+ | name| score|user_id| +-------+------------------+-------+ | user 1| 0.123| 1| | user 2| 0.246| 2| | user 3| 0.369| 3| | user 4| 0.492| 4| | user 5| 0.615| 5| | user 6| 0.738| 6| | user 7| 0.861| 7| | user 8| 0.984| 8| | user 9| 1.107| 9| |user 10| 1.23| 10| |user 11| 1.353| 11| |user 12| 1.476| 12| |user 13| 1.599| 13| |user 14| 1.722| 14| |user 15| 1.845| 15| |user 16| 1.968| 16| |user 17| 2.091| 17| |user 18| 2.214| 18| |user 19|2.3369999999999997| 19| |user 20| 2.46| 20| +-------+------------------+-------+ only showing top 20 rows scala> df2.show +--------+ |valid_id| +--------+ | 2| | 3| | 4| | 5| | 6| +--------+ scala> df1.join(df2, df1("user_id") === df2("valid_id")) res6: org.apache.spark.sql.DataFrame = [name: string, score: double, user_id: int, valid_id: int] scala> res6.collect res7: Array[org.apache.spark.sql.Row] = Array() scala> df1.join(df2, df1("user_id") === df2("valid_id"), "left_outer") res8: org.apache.spark.sql.DataFrame = [name: string, score: double, user_id: int, valid_id: int] scala> res8.count res9: Long = 0 </code></pre> I'm running spark 1.5.0 with scala 2.10.5

You want a (regular) inner join, not an outer join :) <pre class="prettyprint"><code>df1.join(df2, df1("user_id") === df2("valid_id")) </code></pre>

How to filter one spark dataframe against another dataframe

Tags:

spark-dataframe

I'm trying to filter one dataframe against another:

scala> val df1 = sc.parallelize((1 to 100).map(a=>(s"user $a", a*0.123, a))).toDF("name", "score", "user_id")
scala> val df2 = sc.parallelize(List(2,3,4,5,6)).toDF("valid_id")

Now I want to filter df1 and get back a dataframe that contains all the rows in df1 where user_id is in df2("valid_id"). In other words, I want all the rows in df1 where the user_id is either 2,3,4,5 or 6

scala> df1.select("user_id").filter($"user_id" in df2("valid_id"))
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
org.apache.spark.sql.AnalysisException: resolved attribute(s) valid_id#20 missing from user_id#18 in operator !Filter user_id#18 IN (valid_id#20);

On the other hand when I try to do a filter against a function, everything looks great:

scala> df1.select("user_id").filter(($"user_id" % 2) === 0)
res1: org.apache.spark.sql.DataFrame = [user_id: int]

Why am I getting this error? Is there something wrong with my syntax?

following comment I have tried to do a left outer join:

scala> df1.show
+-------+------------------+-------+
|   name|             score|user_id|
+-------+------------------+-------+
| user 1|             0.123|      1|
| user 2|             0.246|      2|
| user 3|             0.369|      3|
| user 4|             0.492|      4|
| user 5|             0.615|      5|
| user 6|             0.738|      6|
| user 7|             0.861|      7|
| user 8|             0.984|      8|
| user 9|             1.107|      9|
|user 10|              1.23|     10|
|user 11|             1.353|     11|
|user 12|             1.476|     12|
|user 13|             1.599|     13|
|user 14|             1.722|     14|
|user 15|             1.845|     15|
|user 16|             1.968|     16|
|user 17|             2.091|     17|
|user 18|             2.214|     18|
|user 19|2.3369999999999997|     19|
|user 20|              2.46|     20|
+-------+------------------+-------+
only showing top 20 rows

scala> df2.show
+--------+
|valid_id|
+--------+
|       2|
|       3|
|       4|
|       5|
|       6|
+--------+

scala> df1.join(df2, df1("user_id") === df2("valid_id"))
res6: org.apache.spark.sql.DataFrame = [name: string, score: double, user_id: int, valid_id: int]
scala> res6.collect
res7: Array[org.apache.spark.sql.Row] = Array()

scala> df1.join(df2, df1("user_id") === df2("valid_id"), "left_outer")
res8: org.apache.spark.sql.DataFrame = [name: string, score: double, user_id: int, valid_id: int]
scala> res8.count
res9: Long = 0

I'm running spark 1.5.0 with scala 2.10.5

513

asked Sep 18 '15 23:09

polo

1 Answers

You want a (regular) inner join, not an outer join :)

df1.join(df2, df1("user_id") === df2("valid_id"))

123

answered Oct 06 '22 01:10

Glennie Helles Sindholt

Related questions
                            
                                Difference between protected and protected[this]
                            
                                How can I save an RDD into HDFS and later read it back?
                            
                                Scala String toInt - Int does not take parameters
                            
                                Scala lazy values : performance penalty? Threadsafe? [duplicate]
                            
                                Inconsistent behaviour for xs.sliding(n) if n is less than size?
                            
                                Scala REPL in Emacs
                            
                                What kind of data structure is used for immutable maps?
                            
                                How to inform SBT to consume specific scala version for plugins?
                            
                                Scala project organization [closed]
                            
                                Which is best data access options available for Play framework with Scala and PostgreSQL?
                            
                                Case class equality in Apache Spark
                            
                                Is there a concept for 'fold with break' or 'find with accumulator' in functional programming?
                            
                                Writing files to local system with Spark in Cluster mode
                            
                                scala mailbox size limit
                            
                                What is the proper way to remove elements from a scala mutable map using a predicate
                            
                                How do you unit test Scala in Eclipse?
                            
                                Scala: what is the real difference between fields in a class and parameters in the constructor
                            
                                Is it possible to lazily traverse a recursive data-structure with O(1) memory usage, tail-call optimized?
                            
                                Scala : reassignment to val [duplicate]
                            
                                Class, Object, Trait, Sealed Trait in Scala

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to filter one spark dataframe against another dataframe

Tags:

scala

apache-spark

apache-spark-sql

spark-dataframe

polo

People also ask

1 Answers

Glennie Helles Sindholt

Recent Activity

Donate For Us