How would you perform basic joins in Spark using python? In R you could use merg() to do this. What is the syntax using python on spark for: <ol> <li>Inner Join</li> <li>Left Outer Join</li> <li>Cross Join</li> </ol> With two tables (RDD) with a single column in each that has a common key. <pre class="prettyprint"><code>RDD(1):(key,U) RDD(2):(key,V) </code></pre> I think an inner join is something like this: <pre class="prettyprint"><code>rdd1.join(rdd2).map(case (key, u, v) => (key, ls ++ rs)); </code></pre> Is that right? I have searched the internet and can't find a good example of joins. Thanks in advance.

It can be done either using <code>PairRDDFunctions</code> or Spark Data Frames. Since data frame operations benefit from Catalyst Optimizer the second option is worth considering. Assuming your data looks as follows: <pre class="prettyprint"><code>rdd1 = sc.parallelize([("foo", 1), ("bar", 2), ("baz", 3)]) rdd2 = sc.parallelize([("foo", 4), ("bar", 5), ("bar", 6)]) </code></pre> <h3>With PairRDDs:</h3> Inner join: <pre class="prettyprint"><code>rdd1.join(rdd2) </code></pre> Left outer join: <pre class="prettyprint"><code>rdd1.leftOuterJoin(rdd2) </code></pre> Cartesian product (doesn't require <code>RDD[(T, U)]</code>): <pre class="prettyprint"><code>rdd1.cartesian(rdd2) </code></pre> Broadcast join (doesn't require <code>RDD[(T, U)]</code>): <ul> <li>see Spark: what's the best strategy for joining a 2-tuple-key RDD with single-key RDD? </li> </ul> Finally there is <code>cogroup</code> which has no direct SQL equivalent but can be useful in some situations: <pre class="prettyprint"><code>cogrouped = rdd1.cogroup(rdd2) cogrouped.mapValues(lambda x: (list(x[0]), list(x[1]))).collect() ## [('foo', ([1], [4])), ('bar', ([2], [5, 6])), ('baz', ([3], []))] </code></pre> <h3>With Spark Data Frames</h3> You can use either SQL DSL or execute raw SQL using <code>sqlContext.sql</code>. <pre class="prettyprint"><code>df1 = spark.createDataFrame(rdd1, ('k', 'v1')) df2 = spark.createDataFrame(rdd2, ('k', 'v2')) # Register temporary tables to be able to use `sparkSession.sql` df1.createOrReplaceTempView('df1') df2.createOrReplaceTempView('df2') </code></pre> Inner join: <pre class="prettyprint"><code># inner is a default value so it could be omitted df1.join(df2, df1.k == df2.k, how='inner') spark.sql('SELECT * FROM df1 JOIN df2 ON df1.k = df2.k') </code></pre> Left outer join: <pre class="prettyprint"><code>df1.join(df2, df1.k == df2.k, how='left_outer') spark.sql('SELECT * FROM df1 LEFT OUTER JOIN df2 ON df1.k = df2.k') </code></pre> Cross join (explicit cross join or configuration changes are required in Spark. 2.0 - spark.sql.crossJoin.enabled for Spark 2.x): <pre class="prettyprint"><code>df1.crossJoin(df2) spark.sql('SELECT * FROM df1 CROSS JOIN df2') </code></pre> <s></s> <pre class="prettyprint"><code>df1.join(df2) sqlContext.sql('SELECT * FROM df JOIN df2') </code></pre> Since 1.6 (1.5 in Scala) each of these can be combined with <code>broadcast</code> function: <pre class="prettyprint"><code>from pyspark.sql.functions import broadcast df1.join(broadcast(df2), df1.k == df2.k) </code></pre> to perform broadcast join. See also Why my BroadcastHashJoin is slower than ShuffledHashJoin in Spark

How do you perform basic joins of two RDD tables in Spark using Python?

Tags:

python

join

apache-spark

rdd

pyspark

How would you perform basic joins in Spark using python? In R you could use merg() to do this. What is the syntax using python on spark for:

Inner Join
Left Outer Join
Cross Join

With two tables (RDD) with a single column in each that has a common key.

RDD(1):(key,U)
RDD(2):(key,V)

I think an inner join is something like this:

rdd1.join(rdd2).map(case (key, u, v) => (key, ls ++ rs));

Is that right? I have searched the internet and can't find a good example of joins. Thanks in advance.

363

asked Jul 06 '15 22:07

invoketheshell

1 Answers

It can be done either using PairRDDFunctions or Spark Data Frames. Since data frame operations benefit from Catalyst Optimizer the second option is worth considering.

Assuming your data looks as follows:

rdd1 =  sc.parallelize([("foo", 1), ("bar", 2), ("baz", 3)])
rdd2 =  sc.parallelize([("foo", 4), ("bar", 5), ("bar", 6)])

With PairRDDs:

Inner join:

rdd1.join(rdd2)

Left outer join:

rdd1.leftOuterJoin(rdd2)

Cartesian product (doesn't require RDD[(T, U)]):

rdd1.cartesian(rdd2)

Broadcast join (doesn't require RDD[(T, U)]):

see Spark: what's the best strategy for joining a 2-tuple-key RDD with single-key RDD?

Finally there is cogroup which has no direct SQL equivalent but can be useful in some situations:

cogrouped = rdd1.cogroup(rdd2)

cogrouped.mapValues(lambda x: (list(x[0]), list(x[1]))).collect()
## [('foo', ([1], [4])), ('bar', ([2], [5, 6])), ('baz', ([3], []))]

With Spark Data Frames

You can use either SQL DSL or execute raw SQL using sqlContext.sql.

df1 = spark.createDataFrame(rdd1, ('k', 'v1'))
df2 = spark.createDataFrame(rdd2, ('k', 'v2'))

# Register temporary tables to be able to use `sparkSession.sql`
df1.createOrReplaceTempView('df1')
df2.createOrReplaceTempView('df2')

Inner join:

# inner is a default value so it could be omitted
df1.join(df2, df1.k == df2.k, how='inner') 
spark.sql('SELECT * FROM df1 JOIN df2 ON df1.k = df2.k')

Left outer join:

df1.join(df2, df1.k == df2.k, how='left_outer')
spark.sql('SELECT * FROM df1 LEFT OUTER JOIN df2 ON df1.k = df2.k')

Cross join (explicit cross join or configuration changes are required in Spark. 2.0 - spark.sql.crossJoin.enabled for Spark 2.x):

df1.crossJoin(df2)
spark.sql('SELECT * FROM df1 CROSS JOIN df2')

df1.join(df2)
sqlContext.sql('SELECT * FROM df JOIN df2')

Since 1.6 (1.5 in Scala) each of these can be combined with broadcast function:

from pyspark.sql.functions import broadcast

df1.join(broadcast(df2), df1.k == df2.k)

to perform broadcast join. See also Why my BroadcastHashJoin is slower than ShuffledHashJoin in Spark

157

answered Oct 26 '22 17:10

zero323

Related questions
                            
                                find xml element based on its attribute and change its value
                            
                                is getPerspectiveTransform broken in opencv python2 wrapper?
                            
                                replacing layout on a QWidget with another layout
                            
                                How to retrieve multiple values returned of a function called through multiprocessing.Process
                            
                                Python - intersection between a list and keys of a dictionary
                            
                                Python if-statement with variable mathematical operator
                            
                                How do I call a specific Method from a Python Script in C#?
                            
                                How do I install Socks / SocksIPy on Ubuntu?
                            
                                Ignore KeyError and continue program
                            
                                How to find integer nth roots?
                            
                                Interactive plotting with Python via command line
                            
                                Pip install error. Setuptools.command not found
                            
                                Changing marker style in scatter plot according to third variable
                            
                                Getting PyCharm to recognize Anaconda's SciPy
                            
                                Two different color colormaps in the same imshow matplotlib
                            
                                Django 1.7 where to put the code to add Groups programmatically?
                            
                                How To Resize a Video Clip in Python
                            
                                What do [] brackets in a for loop in python mean?
                            
                                Extracting polygon given coordinates from an image using OpenCV
                            
                                How to download a full webpage with a Python script?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With