Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark specify multiple column conditions for dataframe join

How to give more column conditions when joining two dataframes. For example I want to run the following :

val Lead_all = Leads.join(Utm_Master,       Leaddetails.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign") ==     Utm_Master.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"), "left") 

I want to join only when these columns match. But above syntax is not valid as cols only takes one string. So how do I get what I want.

like image 570
user568109 Avatar asked Jul 06 '15 07:07

user568109


People also ask

How do I join multiple conditions in PySpark?

join(other, on=None, how=None) Joins with another DataFrame, using the given join expression. The following performs a full outer join between df1 and df2. Parameters: other – Right side of the join on – a string for join column name, a list of column names, , a join expression (Column) or a list of Columns.

How do I join multiple sparks in DataFrame?

In order to explain join with multiple tables, we will use Inner join, this is the default join in Spark and it's mostly used, this joins two DataFrames/Datasets on key columns, and where keys don't match the rows get dropped from both datasets.

How do I merge two DataFrames with different columns in Spark?

Here In first dataframe (dataframe1) , the columns ['ID', 'NAME', 'Address'] and second dataframe (dataframe2 ) columns are ['ID','Age']. Now we have to add the Age column to the first dataframe and NAME and Address in the second dataframe, we can do this by using lit() function. This function is available in pyspark.


2 Answers

There is a Spark column/expression API join for such case:

Leaddetails.join(     Utm_Master,      Leaddetails("LeadSource") <=> Utm_Master("LeadSource")         && Leaddetails("Utm_Source") <=> Utm_Master("Utm_Source")         && Leaddetails("Utm_Medium") <=> Utm_Master("Utm_Medium")         && Leaddetails("Utm_Campaign") <=> Utm_Master("Utm_Campaign"),     "left" ) 

The <=> operator in the example means "Equality test that is safe for null values".

The main difference with simple Equality test (===) is that the first one is safe to use in case one of the columns may have null values.

like image 116
rchukh Avatar answered Sep 30 '22 09:09

rchukh


As of Spark version 1.5.0 (which is currently unreleased), you can join on multiple DataFrame columns. Refer to SPARK-7990: Add methods to facilitate equi-join on multiple join keys.

Python

Leads.join(     Utm_Master,      ["LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"],     "left_outer" ) 

Scala

The question asked for a Scala answer, but I don't use Scala. Here is my best guess....

Leads.join(     Utm_Master,     Seq("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"),     "left_outer" ) 
like image 33
dnlbrky Avatar answered Sep 30 '22 09:09

dnlbrky