I want to perform a join between these two PySpark DataFrames: <pre class="prettyprint lang-py prettyprint-override"><code>from pyspark import SparkContext from pyspark.sql.functions import col sc = SparkContext() df1 = sc.parallelize([ ['owner1', 'obj1', 0.5], ['owner1', 'obj1', 0.2], ['owner2', 'obj2', 0.1] ]).toDF(('owner', 'object', 'score')) df2 = sc.parallelize( [Row(owner=u'owner1', objects=[Row(name=u'obj1', value=Row(fav=True, ratio=0.3))])]).toDF() </code></pre> The join has to be performed on the name of the object, namely the field name inside objects for df2 and object for df1. I am able to perform a SELECT on the nested field, as in <pre class="prettyprint lang-py prettyprint-override"><code>df2.where(df2.owner == 'owner1').select(col("objects.value.ratio")).show() </code></pre> but I am not able to run this join: <pre class="prettyprint lang-py prettyprint-override"><code>df2.alias('u').join(df1.alias('s'), col('u.objects.name') == col('s.object')) </code></pre> which returns error <blockquote> pyspark.sql.utils.AnalysisException: u"cannot resolve '(objects.name = cast(object as double))' due to data type mismatch: differing types in '(objects.name = cast(object as double))' (array and double).;" </blockquote> Any ideas how to solve this?

Since you want to match and extract specific element the simplest approach is to <code>explode</code> the row: <pre class="prettyprint lang-py prettyprint-override"><code>matches = df2.withColumn("object", explode(col("objects"))).alias("u").join( df1.alias("s"), col("s.object") == col("u.object.name") ) matches.show() ## +-------------------+------+-----------------+------+------+-----+ ## | objects| owner| object| owner|object|score| ## +-------------------+------+-----------------+------+------+-----+ ## |[[obj1,[true,0.3]]]|owner1|[obj1,[true,0.3]]|owner1| obj1| 0.5| ## |[[obj1,[true,0.3]]]|owner1|[obj1,[true,0.3]]|owner1| obj1| 0.2| ## +-------------------+------+-----------------+------+------+-----+ </code></pre> Alternative, but very inefficient approach is to use <code>array_contains</code>: <pre class="prettyprint lang-py prettyprint-override"><code>matches_contains = df1.alias("s").join( df2.alias("u"), expr("array_contains(objects.name, object)")) </code></pre> It is ineffective because it will be expanded to Cartesian product: <pre class="prettyprint lang-py prettyprint-override"><code>matches_contains.explain() ## == Physical Plan == ## Filter array_contains(objects#6.name,object#4) ## +- CartesianProduct ## :- Scan ExistingRDD[owner#3,object#4,score#5] ## +- Scan ExistingRDD[objects#6,owner#7] </code></pre> If size of the array is relatively small it is possible to generate optimized version of <code>array_contains</code> as I've shown here: Filter by whether column value equals a list in spark

Joining PySpark DataFrames on nested field

Tags:

join

dataframe

apache-spark

apache-spark-sql

pyspark

I want to perform a join between these two PySpark DataFrames:

from pyspark import SparkContext
from pyspark.sql.functions import col

sc = SparkContext()

df1 = sc.parallelize([
    ['owner1', 'obj1', 0.5],
    ['owner1', 'obj1', 0.2],
    ['owner2', 'obj2', 0.1]
]).toDF(('owner', 'object', 'score'))

df2 = sc.parallelize(
          [Row(owner=u'owner1',
           objects=[Row(name=u'obj1', value=Row(fav=True, ratio=0.3))])]).toDF()

The join has to be performed on the name of the object, namely the field name inside objects for df2 and object for df1.

I am able to perform a SELECT on the nested field, as in

df2.where(df2.owner == 'owner1').select(col("objects.value.ratio")).show()

but I am not able to run this join:

df2.alias('u').join(df1.alias('s'), col('u.objects.name') == col('s.object'))

which returns error

pyspark.sql.utils.AnalysisException: u"cannot resolve '(objects.name = cast(object as double))' due to data type mismatch: differing types in '(objects.name = cast(object as double))' (array and double).;"

Any ideas how to solve this?

246

asked Apr 12 '16 14:04

mar tin

1 Answers

Since you want to match and extract specific element the simplest approach is to explode the row:

matches = df2.withColumn("object", explode(col("objects"))).alias("u").join(
  df1.alias("s"),
  col("s.object") == col("u.object.name")
)

matches.show()
## +-------------------+------+-----------------+------+------+-----+
## |            objects| owner|           object| owner|object|score|
## +-------------------+------+-----------------+------+------+-----+
## |[[obj1,[true,0.3]]]|owner1|[obj1,[true,0.3]]|owner1|  obj1|  0.5|
## |[[obj1,[true,0.3]]]|owner1|[obj1,[true,0.3]]|owner1|  obj1|  0.2|
## +-------------------+------+-----------------+------+------+-----+

Alternative, but very inefficient approach is to use array_contains:

matches_contains = df1.alias("s").join(
  df2.alias("u"), expr("array_contains(objects.name, object)"))

It is ineffective because it will be expanded to Cartesian product:

matches_contains.explain()
## == Physical Plan ==
## Filter array_contains(objects#6.name,object#4)
## +- CartesianProduct
##    :- Scan ExistingRDD[owner#3,object#4,score#5] 
##    +- Scan ExistingRDD[objects#6,owner#7]

If size of the array is relatively small it is possible to generate optimized version of array_contains as I've shown here: Filter by whether column value equals a list in spark

102

answered Nov 02 '22 22:11

zero323

Related questions
                            
                                Is there a good way to join a stream in spark with a changing table?
                            
                                PySpark MongoDB :: java.lang.NoClassDefFoundError: com/mongodb/client/model/Collation
                            
                                python spark alternative to explode for very large data
                            
                                pyspark - aggregate (sum) vector element-wise
                            
                                Is there an explanation when spark-csv won't save a DataFrame to file?
                            
                                Passing multiple columns in Pandas UDF PySpark
                            
                                Efficient way to add UUID in pyspark [duplicate]
                            
                                Spark: unable to load native-hadoop library for platform
                            
                                How to use PathFilter in Apache Spark?
                            
                                How i can integrate Apache Spark with the Play Framework to display predictions in real time?
                            
                                Simplest method for text lemmatization in Scala and Spark
                            
                                Installing Modules for SPARK on worker nodes
                            
                                Processing multiple files as independent RDD's in parallel
                            
                                How to convert a map to Spark's RDD
                            
                                Use spark in a sbt project in intellij
                            
                                Spark using Python : save RDD output into text files
                            
                                Spark sum up values regardless of keys
                            
                                How to get files name with spark sc.textFile?
                            
                                Spark spark-submit --jars arguments wants comma list, how to declare a directory of jars?
                            
                                Spark: Force two RDD[Key, Value] with co-located partitions using custom partitioner

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With