How to give more column conditions when joining two dataframes. For example I want to run the following :
val Lead_all = Leads.join(Utm_Master, Leaddetails.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign") == Utm_Master.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"), "left")
I want to join only when these columns match. But above syntax is not valid as cols only takes one string. So how do I get what I want.
join(other, on=None, how=None) Joins with another DataFrame, using the given join expression. The following performs a full outer join between df1 and df2. Parameters: other – Right side of the join on – a string for join column name, a list of column names, , a join expression (Column) or a list of Columns.
In order to explain join with multiple tables, we will use Inner join, this is the default join in Spark and it's mostly used, this joins two DataFrames/Datasets on key columns, and where keys don't match the rows get dropped from both datasets.
Here In first dataframe (dataframe1) , the columns ['ID', 'NAME', 'Address'] and second dataframe (dataframe2 ) columns are ['ID','Age']. Now we have to add the Age column to the first dataframe and NAME and Address in the second dataframe, we can do this by using lit() function. This function is available in pyspark.
There is a Spark column/expression API join for such case:
Leaddetails.join( Utm_Master, Leaddetails("LeadSource") <=> Utm_Master("LeadSource") && Leaddetails("Utm_Source") <=> Utm_Master("Utm_Source") && Leaddetails("Utm_Medium") <=> Utm_Master("Utm_Medium") && Leaddetails("Utm_Campaign") <=> Utm_Master("Utm_Campaign"), "left" )
The <=>
operator in the example means "Equality test that is safe for null values".
The main difference with simple Equality test (===
) is that the first one is safe to use in case one of the columns may have null values.
As of Spark version 1.5.0 (which is currently unreleased), you can join on multiple DataFrame columns. Refer to SPARK-7990: Add methods to facilitate equi-join on multiple join keys.
Python
Leads.join( Utm_Master, ["LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"], "left_outer" )
Scala
The question asked for a Scala answer, but I don't use Scala. Here is my best guess....
Leads.join( Utm_Master, Seq("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"), "left_outer" )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With