Spark: Join dataframe column with an array

2 Answers

The best way to do this (and the one that doesn't require any casting or exploding of dataframes) is to use the array_contains spark sql expression as shown below.

import org.apache.spark.sql.functions.expr
import spark.implicits._

val df1 = Seq((1L,"one.df1"), (2L,"two.df1"),(3L,"three.df1")).toDF("key1","Value")

val df2 = Seq((Array(1L,1L),"one.df2"), (Array(2L,2L),"two.df2"), (Array(3L,3L),"three.df2")).toDF("key2","Value")

val joinedRDD = df1.join(df2, expr("array_contains(key2, key1)")).show

+----+---------+------+---------+
|key1|    Value|  key2|    Value|
+----+---------+------+---------+
|   1|  one.df1|[1, 1]|  one.df2|
|   2|  two.df1|[2, 2]|  two.df2|
|   3|three.df1|[3, 3]|three.df2|
+----+---------+------+---------+

Please note that you cannot use the org.apache.spark.sql.functions.array_contains function directly as it requires the second argument to be a literal as opposed to a column expression.

answered Sep 28 '22 17:09

randal25

You can cast the type of key1 and key2 and then use the contains function, as follow.

val df1 = sc.parallelize(Seq((1L,"one.df1"), 
                             (2L,"two.df1"),      
                             (3L,"three.df1"))).toDF("key1","Value")  

DF1:
+----+---------+
|key1|Value    |
+----+---------+
|1   |one.df1  |
|2   |two.df1  |
|3   |three.df1|
+----+---------+

val df2 = sc.parallelize(Seq((Array(1L,1L),"one.df2"),
                             (Array(2L,2L),"two.df2"),
                             (Array(3L,3L),"three.df2"))).toDF("key2","Value")
DF2:
+------+---------+
|key2  |Value    |
+------+---------+
|[1, 1]|one.df2  |
|[2, 2]|two.df2  |
|[3, 3]|three.df2|
+------+---------+

val joinedRDD = df1.join(df2, col("key2").cast("string").contains(col("key1").cast("string")))

JOIN:
+----+---------+------+---------+
|key1|Value    |key2  |Value    |
+----+---------+------+---------+
|1   |one.df1  |[1, 1]|one.df2  |
|2   |two.df1  |[2, 2]|two.df2  |
|3   |three.df1|[3, 3]|three.df2|
+----+---------+------+---------+

answered Sep 28 '22 18:09

pheeleeppoo

Related questions
                            
                                SQL Join vs Separate Query in Code without Join - Performance
                            
                                PHP/MySQL: Getting Multiple Columns With the Same Name in Join Query Without Aliases? [duplicate]
                            
                                Join datatables using column names stored in variables
                            
                                Merging two tables with millions of rows in Python
                            
                                Postgres RETURNING clause with join
                            
                                Is there some sort of syntax error with this LINQ JOIN?
                            
                                Big-Oh Performance of an Inner Join on Two Indexes
                            
                                Where are Cartesian Joins used in real life?
                            
                                GAE - How to live with no joins?
                            
                                Hibernate Criteria and multiple join
                            
                                MySQL select join where AND where
                            
                                MYSQL - Using SUM with JOIN
                            
                                Select rows from a table where row in another table with same id has a particular value in another column
                            
                                Comparing two tables for equality in HIVE
                            
                                mysql join with condition
                            
                                How to use Rails 3 scope to filter on habtm join table where the associated records don't exist?
                            
                                Select sum and inner join
                            
                                MySQL: Use CASE/ELSE value as join parameter
                            
                                SQLite query with multiple joins
                            
                                Sequelize hasMany Join association

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark: Join dataframe column with an array

Tags:

join

apache-spark

syl

People also ask

2 Answers

randal25

pheeleeppoo

Recent Activity

Donate For Us