Spark specify multiple column conditions for dataframe join

Tags:

How to give more column conditions when joining two dataframes. For example I want to run the following :

val Lead_all = Leads.join(Utm_Master,       Leaddetails.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign") ==     Utm_Master.columns("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"), "left")

I want to join only when these columns match. But above syntax is not valid as cols only takes one string. So how do I get what I want.

570

asked Jul 06 '15 07:07

user568109

2 Answers

There is a Spark column/expression API join for such case:

Click to copy

Leaddetails.join(     Utm_Master,      Leaddetails("LeadSource") <=> Utm_Master("LeadSource")         && Leaddetails("Utm_Source") <=> Utm_Master("Utm_Source")         && Leaddetails("Utm_Medium") <=> Utm_Master("Utm_Medium")         && Leaddetails("Utm_Campaign") <=> Utm_Master("Utm_Campaign"),     "left" )

The <=> operator in the example means "Equality test that is safe for null values".

The main difference with simple Equality test (===) is that the first one is safe to use in case one of the columns may have null values.

116

answered Sep 30 '22 09:09

rchukh

As of Spark version 1.5.0 (which is currently unreleased), you can join on multiple DataFrame columns. Refer to SPARK-7990: Add methods to facilitate equi-join on multiple join keys.

Python

Click to copy

Leads.join(     Utm_Master,      ["LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"],     "left_outer" )

Scala

The question asked for a Scala answer, but I don't use Scala. Here is my best guess....

Click to copy

Leads.join(     Utm_Master,     Seq("LeadSource","Utm_Source","Utm_Medium","Utm_Campaign"),     "left_outer" )

answered Sep 30 '22 09:09

dnlbrky

Related questions
                            
                                Spark 1.4 increase maxResultSize memory
                            
                                How to handle categorical features with spark-ml?
                            
                                Filtering a Pyspark DataFrame with SQL-like IN clause
                            
                                What is a task in Spark? How does the Spark worker execute the jar file?
                            
                                Difference between DataSet API and DataFrame API [duplicate]
                            
                                Application report for application_ (state: ACCEPTED) never ends for Spark Submit (with Spark 1.2.0 on YARN)
                            
                                How to optimize shuffle spill in Apache Spark application
                            
                                What is the Spark DataFrame method `toPandas` actually doing?
                            
                                Spark: what's the best strategy for joining a 2-tuple-key RDD with single-key RDD?
                            
                                Installing of SparkR
                            
                                Flattening Rows in Spark
                            
                                dataframe: how to groupBy/count then filter on count in Scala
                            
                                Spark Window Functions - rangeBetween dates
                            
                                What is the difference between cube, rollup and groupBy operators?
                            
                                Reduce a key-value pair into a key-list pair with Apache Spark
                            
                                How to deal with executor memory and driver memory in Spark?
                            
                                How to reduce the verbosity of Spark's runtime output?
                            
                                Spark iterate HDFS directory
                            
                                Spark unionAll multiple dataframes
                            
                                get datatype of column using pyspark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark specify multiple column conditions for dataframe join

Tags:

apache-spark

rdd

apache-spark-sql

user568109

People also ask

2 Answers

rchukh

dnlbrky

Recent Activity

Donate For Us