What's the most efficient way to filter a DataFrame

Tags:

apache-spark-sql

... by checking whether a columns' value is in a seq.
Perhaps I'm not explaining it very well, I basically want this (to express it using regular SQL): DF_Column IN seq?

First I did it using a broadcast var (where I placed the seq), UDF (that did the checking) and registerTempTable.
The problem is that I didn't get to test it since I ran into a known bug that apparently only appears when using registerTempTable with ScalaIDE.

I ended up creating a new DataFrame out of seq and doing inner join with it (intersection), but I doubt that's the most performant way of accomplishing the task.

Thanks

EDIT: (in response to @YijieShen):
How to do filter based on whether elements of one DataFrame's column are in another DF's column (like SQL select * from A where login in (select username from B))?

E.g: First DF:

login      count login1     192   login2     146   login3     72

Second DF:

username login2 login3 login4

The result:

login      count login2     146   login3     72

Attempts:
EDIT-2: I think, now that the bug is fixed, these should work. END EDIT-2

ordered.select("login").filter($"login".contains(empLogins("username")))

and

ordered.select("login").filter($"login" in empLogins("username"))

which both throw Exception in thread "main" org.apache.spark.sql.AnalysisException, respectively:

resolved attribute(s) username#10 missing from login#8 in operator  !Filter Contains(login#8, username#10);

and

resolved attribute(s) username#10 missing from login#8 in operator  !Filter login#8 IN (username#10);

392

asked Apr 22 '15 12:04

Marko Bonaci

1 Answers

My code (following the description of your first method) runs normally in Spark 1.4.0-SNAPSHOT on these two configurations:

Intellij IDEA's test
Spark Standalone cluster with 8 nodes (1 master, 7 worker)

Please check if any differences exists

val bc = sc.broadcast(Array[String]("login3", "login4")) val x = Array(("login1", 192), ("login2", 146), ("login3", 72)) val xdf = sqlContext.createDataFrame(x).toDF("name", "cnt")  val func: (String => Boolean) = (arg: String) => bc.value.contains(arg) val sqlfunc = udf(func) val filtered = xdf.filter(sqlfunc(col("name")))  xdf.show() filtered.show()

Output

name cnt
login1 192
login2 146
login3 72

name cnt
login3 72

131

answered Sep 30 '22 18:09

yjshen

Related questions
                            
                                Differences between null and NaN in spark? How to deal with it?
                            
                                Best Practice to launch Spark Applications via Web Application?
                            
                                Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database
                            
                                Explode in PySpark
                            
                                Iterate rows and columns in Spark dataframe
                            
                                Apache Hadoop Yarn - Underutilization of cores
                            
                                How to save a spark DataFrame as csv on disk?
                            
                                How to use AND or OR condition in when in Spark
                            
                                Read multiline JSON in Apache Spark
                            
                                Map can not be serializable in scala?
                            
                                Trim string column in PySpark dataframe
                            
                                SparkSQL: How to deal with null values in user defined function?
                            
                                How spark read a large file (petabyte) when file can not be fit in spark's main memory
                            
                                Pyspark: get list of files/directories on HDFS path
                            
                                Create spark dataframe schema from json schema representation
                            
                                Apache Spark: Splitting Pair RDD into multiple RDDs by key to save values
                            
                                Spark / Scala: forward fill with last observation
                            
                                How do I stop a spark streaming job?
                            
                                Spark final task takes 100x times longer than first 199, how to improve
                            
                                How to find the master URL for an existing spark cluster

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With