Joining Spark dataframes on the key

1 Answers

Alias Approach using scala (this is example given for older version of spark for spark 2.x see my other answer) :

You can use case class to prepare sample dataset ... which is optional for ex: you can get DataFrame from hiveContext.sql as well..

import org.apache.spark.sql.functions.col  case class Person(name: String, age: Int, personid : Int)  case class Profile(name: String, personid  : Int , profileDescription: String)      val df1 = sqlContext.createDataFrame(    Person("Bindu",20,  2)  :: Person("Raphel",25, 5)  :: Person("Ram",40, 9):: Nil)   val df2 = sqlContext.createDataFrame( Profile("Spark",2,  "SparkSQLMaster")  :: Profile("Spark",5, "SparkGuru")  :: Profile("Spark",9, "DevHunter"):: Nil )  // you can do alias to refer column name with aliases to  increase readablity  val df_asPerson = df1.as("dfperson") val df_asProfile = df2.as("dfprofile")   val joined_df = df_asPerson.join(     df_asProfile , col("dfperson.personid") === col("dfprofile.personid") , "inner")   joined_df.select(   col("dfperson.name") , col("dfperson.age") , col("dfprofile.name") , col("dfprofile.profileDescription")) .show

sample Temp table approach which I don't like personally...

The reason to use the registerTempTable( tableName ) method for a DataFrame, is so that in addition to being able to use the Spark-provided methods of a DataFrame, you can also issue SQL queries via the sqlContext.sql( sqlQuery ) method, that use that DataFrame as an SQL table. The tableName parameter specifies the table name to use for that DataFrame in the SQL queries.

df_asPerson.registerTempTable("dfperson"); df_asProfile.registerTempTable("dfprofile")  sqlContext.sql("""SELECT dfperson.name, dfperson.age, dfprofile.profileDescription                   FROM  dfperson JOIN  dfprofile                   ON dfperson.personid == dfprofile.personid""")

If you want to know more about joins pls see this nice post : beyond-traditional-join-with-apache-spark

enter image description here

Note : 1) As mentioned by @RaphaelRoth ,

val resultDf = PersonDf.join(ProfileDf,Seq("personId")) is good approach since it doesnt have duplicate columns from both sides if you are using inner join with same table.
2) Spark 2.x example updated in another answer with full set of join operations supported by spark 2.x with examples + result

TIP :

Also, important thing in joins : broadcast function can help to give hint please see my answer

181

answered Sep 22 '22 19:09

Ram Ghadiyaram

Related questions
                            
                                Scala import not working - object <name> is not a member of package, sbt preppends current package namespace in imports
                            
                                permanently hidden warning in scala application
                            
                                How do I read a large CSV file with Scala Stream class?
                            
                                Comparing Lift with Play2 [closed]
                            
                                Linearization order in Scala
                            
                                Spark extracting values from a Row
                            
                                LINQ analogues in Scala?
                            
                                What is a diverging implicit expansion error?
                            
                                Substantial Android development in Scala [closed]
                            
                                Using Scalaz Stream for parsing task (replacing Scalaz Iteratees)
                            
                                Why is PartialFunction <: Function in Scala?
                            
                                What is coming up for scala in 2.10? [closed]
                            
                                How to mix-in a trait to instance?
                            
                                How are Java threads heavy compared to Scala / Akka actors?
                            
                                Noise free JSON format for sealed traits with Play 2.2 library
                            
                                Should exceptions be case classes?
                            
                                How to convert String to date time in Scala?
                            
                                How to use switch/case (simple pattern matching) in Scala?
                            
                                Does Scala have guards?
                            
                                forall in Scala

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Joining Spark dataframes on the key

Tags:

dataframe

scala

apache-spark

apache-spark-sql

Bindumalini KK

People also ask