Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark join multiple conditions

Tags:

How I can specify lot of conditions in pyspark when I use .join()

Example : with hive :

query= "select a.NUMCNT,b.NUMCNT as RNUMCNT ,a.POLE,b.POLE as RPOLE,a.ACTIVITE,b.ACTIVITE as RACTIVITE FROM rapexp201412 b \     join rapexp201412 a where (a.NUMCNT=b.NUMCNT and a.ACTIVITE = b.ACTIVITE and a.POLE =b.POLE  )\ 

But in PySpark I don't know how to make it because the following:

df_rapexp201412.join(df_aeveh,df_rapexp2014.ACTIVITE==df_rapexp2014.ACTIVITE and df_rapexp2014.POLE==df_aeveh.POLE,'inner') 

It does not work!!

like image 226
malouke Avatar asked Dec 02 '15 11:12

malouke


People also ask

How do I join two Pyspark data frames?

Summary: Pyspark DataFrames have a join method which takes three parameters: DataFrame on the right side of the join, Which fields are being joined on, and what type of join (inner, outer, left_outer, right_outer, leftsemi). You call the join method from the left side DataFrame object such as df1. join(df2, df1.

How do you join two columns in Pyspark?

Concatenating columns in pyspark is accomplished using concat() Function. Concatenating two columns is accomplished using concat() Function. Concatenating multiple columns is accomplished using concat() Function. Concatenating columns in pyspark is accomplished using concat() Function.


1 Answers

Quoting from spark docs:

(https://spark.apache.org/docs/1.5.2/api/python/pyspark.sql.html?highlight=dataframe%20join#pyspark.sql.DataFrame.join)

join(other, on=None, how=None) Joins with another DataFrame, using the given join expression.

The following performs a full outer join between df1 and df2.

Parameters: other – Right side of the join on – a string for join column name, a list of column names, , a join expression (Column) or a list of Columns. If on is a string or a list of string indicating the name of the join column(s), the column(s) must exist on both sides, and this performs an inner equi-join. how – str, default ‘inner’. One of inner, outer, left_outer, right_outer, semijoin.

>>> df.join(df2, df.name == df2.name, 'outer').select(df.name, df2.height).collect()  [Row(name=None, height=80), Row(name=u'Alice', height=None), Row(name=u'Bob', height=85)]   >>> cond = [df.name == df3.name, df.age == df3.age] >>> df.join(df3, cond, 'outer').select(df.name, df3.age).collect() [Row(name=u'Bob', age=5), Row(name=u'Alice', age=2)] 

So you need to use the "condition as a list" option like in the last example.

like image 115
user3689574 Avatar answered Sep 25 '22 07:09

user3689574