Spark: Joining with array

Tags:

I need to join a dataframe with a string column to one with array of string so that if one of the values in the array is matched, the rows will join.

I tried this but I guess it's not support. Any other way to do this?

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

val sparkConf = new SparkConf().setMaster("local[*]").setAppName("test")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()

import spark.implicits._

val left = spark.sparkContext.parallelize(Seq(1, 2, 3)).toDF("col1")
val right = spark.sparkContext.parallelize(Seq((Array(1, 2), "Yes"),(Array(3),"No"))).toDF("col1", "col2")

left.join(right,"col1")

Throws:

org.apache.spark.sql.AnalysisException: cannot resolve '(col1 =col1)' due to data type mismatch: differing types in '(col1 =

col1)' (int and array).;;

385

asked Aug 07 '17 11:08

aclowkay

2 Answers

The most succinct way to do this is to use the array_contains spark sql expression as shown below, that said I've compared the performance of this with the performance of doing an explode and join as shown in a previous answer and the explode seems more performant.

import org.apache.spark.sql.functions.expr
import spark.implicits._

val left = Seq(1, 2, 3).toDF("col1")

val right = Seq((Array(1, 2), "Yes"),(Array(3),"No")).toDF("col1", "col2").withColumnRenamed("col1", "col1_array")

val joined = left.join(right, expr("array_contains(col1_array, col1)")).show

+----+----------+----+
|col1|col1_array|col2|
+----+----------+----+
|   1|    [1, 2]| Yes|
|   2|    [1, 2]| Yes|
|   3|       [3]|  No|
+----+----------+----+

Note you can't use the org.apache.spark.sql.functions.array_contains function directly as it requires the second argument to be a literal as opposed to a column expression.

191

answered Oct 19 '22 11:10

randal25

One option is to create an UDF for building your join condition:

import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray

val left = spark.sparkContext.parallelize(Seq(1, 2, 3)).toDF("col1")
val right = spark.sparkContext.parallelize(Seq((Array(1, 2), "Yes"),(Array(3),"No"))).toDF("col1", "col2")

val checkValue = udf { 
  (array: WrappedArray[Int], value: Int) => array.contains(value) 
}
val result = left.join(right, checkValue(right("col1"), left("col1")), "inner")

result.show

+----+------+----+
|col1|  col1|col2|
+----+------+----+
|   1|[1, 2]| Yes|
|   2|[1, 2]| Yes|
|   3|   [3]|  No|
+----+------+----+

answered Oct 19 '22 09:10

Daniel de Paula

Related questions
                            
                                spark - scala: not a member of org.apache.spark.sql.Row
                            
                                Scala puts precedence on implicit conversion over "natural" operations... Why? Is this a bug? Or am I doing something wrong?
                            
                                Scala - Java = ? (Or Clojure - Java = ?)
                            
                                Shorter Scala Script header
                            
                                How to have a Scala standalone application that uses the playframework libraries
                            
                                grab scala REPL history (from sbt console)
                            
                                Scala problem - how to run a program that is in a package?
                            
                                How to read from zipped xml files in Scala code?
                            
                                Why is the main function not running in the REPL?
                            
                                Parsing very large xml lazily
                            
                                Parameterized logging in slf4j - how does it compare to scala's by-name parameters?
                            
                                Scala: split string via pattern matching
                            
                                How do I sbt test continually until a flakey test fails?
                            
                                Listening to Akka-HTTP ports remotely
                            
                                Spark Select with a List of Columns Scala
                            
                                Scala insert into list at specific locations
                            
                                If-statement scoped variables
                            
                                Scala best practices: simple Option[] usage
                            
                                How to create a synchronized object method in scala
                            
                                Removing punctuation marks form text in Scala - Spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark: Joining with array

Tags:

scala

apache-spark

apache-spark-sql

aclowkay

People also ask

2 Answers

randal25

Daniel de Paula

Recent Activity

Donate For Us