Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to join on multiple columns in Pyspark?

I am using Spark 1.3 and would like to join on multiple columns using python interface (SparkSQL)

The following works:

I first register them as temp tables.

numeric.registerTempTable("numeric") Ref.registerTempTable("Ref")  test  = numeric.join(Ref, numeric.ID == Ref.ID, joinType='inner') 

I would now like to join them based on multiple columns.

I get SyntaxError: invalid syntax with this:

test  = numeric.join(Ref,    numeric.ID == Ref.ID AND numeric.TYPE == Ref.TYPE AND    numeric.STATUS == Ref.STATUS ,  joinType='inner') 
like image 424
user3803714 Avatar asked Nov 16 '15 22:11

user3803714


People also ask

How do I join multiple conditions in Pyspark?

join(other, on=None, how=None) Joins with another DataFrame, using the given join expression. The following performs a full outer join between df1 and df2. Parameters: other – Right side of the join on – a string for join column name, a list of column names, , a join expression (Column) or a list of Columns.

How do you join columns in Pyspark?

Concatenating columns in pyspark is accomplished using concat() Function. Concatenating two columns is accomplished using concat() Function. Concatenating multiple columns is accomplished using concat() Function. Concatenating columns in pyspark is accomplished using concat() Function.

How do I join two Dataframes in Pyspark based on column?

Summary: Pyspark DataFrames have a join method which takes three parameters: DataFrame on the right side of the join, Which fields are being joined on, and what type of join (inner, outer, left_outer, right_outer, leftsemi). You call the join method from the left side DataFrame object such as df1. join(df2, df1.


1 Answers

You should use & / | operators and be careful about operator precedence (== has lower precedence than bitwise AND and OR):

df1 = sqlContext.createDataFrame(     [(1, "a", 2.0), (2, "b", 3.0), (3, "c", 3.0)],     ("x1", "x2", "x3"))  df2 = sqlContext.createDataFrame(     [(1, "f", -1.0), (2, "b", 0.0)], ("x1", "x2", "x3"))  df = df1.join(df2, (df1.x1 == df2.x1) & (df1.x2 == df2.x2)) df.show()  ## +---+---+---+---+---+---+ ## | x1| x2| x3| x1| x2| x3| ## +---+---+---+---+---+---+ ## |  2|  b|3.0|  2|  b|0.0| ## +---+---+---+---+---+---+ 
like image 151
zero323 Avatar answered Oct 11 '22 21:10

zero323