I am using Spark 1.3 and would like to join on multiple columns using python interface (SparkSQL) The following works: I first register them as temp tables. <pre class="prettyprint"><code>numeric.registerTempTable("numeric") Ref.registerTempTable("Ref") test = numeric.join(Ref, numeric.ID == Ref.ID, joinType='inner') </code></pre> I would now like to join them based on multiple columns. I get <code>SyntaxError</code>: invalid syntax with this: <pre class="prettyprint"><code>test = numeric.join(Ref, numeric.ID == Ref.ID AND numeric.TYPE == Ref.TYPE AND numeric.STATUS == Ref.STATUS , joinType='inner') </code></pre>

You should use <code>&</code> / <code>|</code> operators and be careful about operator precedence (<code>==</code> has lower precedence than bitwise <code>AND</code> and <code>OR</code>): <pre class="prettyprint"><code>df1 = sqlContext.createDataFrame( [(1, "a", 2.0), (2, "b", 3.0), (3, "c", 3.0)], ("x1", "x2", "x3")) df2 = sqlContext.createDataFrame( [(1, "f", -1.0), (2, "b", 0.0)], ("x1", "x2", "x3")) df = df1.join(df2, (df1.x1 == df2.x1) & (df1.x2 == df2.x2)) df.show() ## +---+---+---+---+---+---+ ## | x1| x2| x3| x1| x2| x3| ## +---+---+---+---+---+---+ ## | 2| b|3.0| 2| b|0.0| ## +---+---+---+---+---+---+ </code></pre>

How to join on multiple columns in Pyspark?

Tags:

python

join

apache-spark

apache-spark-sql

pyspark

I am using Spark 1.3 and would like to join on multiple columns using python interface (SparkSQL)

The following works:

I first register them as temp tables.

numeric.registerTempTable("numeric") Ref.registerTempTable("Ref")  test  = numeric.join(Ref, numeric.ID == Ref.ID, joinType='inner')

I would now like to join them based on multiple columns.

I get SyntaxError: invalid syntax with this:

test  = numeric.join(Ref,    numeric.ID == Ref.ID AND numeric.TYPE == Ref.TYPE AND    numeric.STATUS == Ref.STATUS ,  joinType='inner')

424

asked Nov 16 '15 22:11

user3803714

1 Answers

You should use & / | operators and be careful about operator precedence (== has lower precedence than bitwise AND and OR):

df1 = sqlContext.createDataFrame(     [(1, "a", 2.0), (2, "b", 3.0), (3, "c", 3.0)],     ("x1", "x2", "x3"))  df2 = sqlContext.createDataFrame(     [(1, "f", -1.0), (2, "b", 0.0)], ("x1", "x2", "x3"))  df = df1.join(df2, (df1.x1 == df2.x1) & (df1.x2 == df2.x2)) df.show()  ## +---+---+---+---+---+---+ ## | x1| x2| x3| x1| x2| x3| ## +---+---+---+---+---+---+ ## |  2|  b|3.0|  2|  b|0.0| ## +---+---+---+---+---+---+

151

answered Oct 11 '22 21:10

zero323

Related questions
                            
                                Remove dictionary from list
                            
                                How to get most informative features for scikit-learn classifiers?
                            
                                defaultdict with default value 1?
                            
                                How do I make a fixed size formatted string in python? [duplicate]
                            
                                Read and Write CSV files including unicode with Python 2.7
                            
                                Removing handlers from python's logging loggers
                            
                                Using other keys for the waitKey() function of opencv
                            
                                Is it better to use "is" or "==" for number comparison in Python? [duplicate]
                            
                                Do comments slow down an interpreted language?
                            
                                Automating pydrive verification process
                            
                                How can I limit iterations of a loop in Python?
                            
                                How to measure server response time for Python requests POST-request
                            
                                Calling static method in python
                            
                                Change y range to start from 0 with matplotlib
                            
                                Python mock Patch os.environ and return value
                            
                                Matplotlib overlapping annotations / text
                            
                                How to export Keras .h5 to tensorflow .pb?
                            
                                How to print the sign + of a digit for positive numbers in Python
                            
                                How to implement server push in Flask framework?
                            
                                NameError: name 'List' is not defined

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With