I need to join two ordinary <code>RDDs</code> on one/more columns. Logically this operation is equivalent to the database join operation of two tables. I wonder if this is possible only through <code>Spark SQL</code> or there are other ways of doing it. As a concrete example, consider RDD <code>r1</code> with primary key <code>ITEM_ID</code>: <pre class="prettyprint"><code>(ITEM_ID, ITEM_NAME, ITEM_UNIT, COMPANY_ID) </code></pre> and RDD <code>r2</code> with primary key <code>COMPANY_ID</code>: <pre class="prettyprint"><code>(COMPANY_ID, COMPANY_NAME, COMPANY_CITY) </code></pre> I want to join <code>r1</code> and <code>r2</code>. How can this be done?

Soumya Simanta gave a good answer. However, the values in joined RDD are <code>Iterable</code>, so the results may not be very similar to ordinary table joining. Alternatively, you can: <pre class="prettyprint"><code>val mappedItems = items.map(item => (item.companyId, item)) val mappedComp = companies.map(comp => (comp.companyId, comp)) mappedItems.join(mappedComp).take(10).foreach(println) </code></pre> The output would be: <pre class="prettyprint"><code>(c1,(Item(1,first,2,c1),Company(c1,company-1,city-1))) (c1,(Item(2,second,2,c1),Company(c1,company-1,city-1))) (c2,(Item(3,third,2,c2),Company(c2,company-2,city-2))) </code></pre>

Join two ordinary RDDs with/without Spark SQL

Tags:

join

scala

apache-spark

rdd

apache-spark-sql

I need to join two ordinary RDDs on one/more columns. Logically this operation is equivalent to the database join operation of two tables. I wonder if this is possible only through Spark SQL or there are other ways of doing it.

As a concrete example, consider RDD r1 with primary key ITEM_ID:

(ITEM_ID, ITEM_NAME, ITEM_UNIT, COMPANY_ID)

and RDD r2 with primary key COMPANY_ID:

(COMPANY_ID, COMPANY_NAME, COMPANY_CITY)

I want to join r1 and r2.

How can this be done?

441

asked Dec 12 '14 05:12

learning_spark

2 Answers

Soumya Simanta gave a good answer. However, the values in joined RDD are Iterable, so the results may not be very similar to ordinary table joining.

Alternatively, you can:

val mappedItems = items.map(item => (item.companyId, item)) val mappedComp = companies.map(comp => (comp.companyId, comp)) mappedItems.join(mappedComp).take(10).foreach(println)

The output would be:

(c1,(Item(1,first,2,c1),Company(c1,company-1,city-1))) (c1,(Item(2,second,2,c1),Company(c1,company-1,city-1))) (c2,(Item(3,third,2,c2),Company(c2,company-2,city-2)))

answered Sep 16 '22 17:09

viirya

(Using Scala) Let say you have two RDDs:

emp: (empid, ename, dept)
dept: (dname, dept)

Following is another way:

//val emp = sc.parallelize(Seq((1,"jordan",10), (2,"ricky",20), (3,"matt",30), (4,"mince",35), (5,"rhonda",30))) val emp = sc.parallelize(Seq(("jordan",10), ("ricky",20), ("matt",30), ("mince",35), ("rhonda",30)))  val dept = sc.parallelize(Seq(("hadoop",10), ("spark",20), ("hive",30), ("sqoop",40)))  //val shifted_fields_emp = emp.map(t => (t._3, t._1, t._2)) val shifted_fields_emp = emp.map(t => (t._2, t._1))  val shifted_fields_dept = dept.map(t => (t._2,t._1))  shifted_fields_emp.join(shifted_fields_dept) // Create emp RDD val emp = sc.parallelize(Seq((1,"jordan",10), (2,"ricky",20), (3,"matt",30), (4,"mince",35), (5,"rhonda",30)))  // Create dept RDD val dept = sc.parallelize(Seq(("hadoop",10), ("spark",20), ("hive",30), ("sqoop",40)))  // Establishing that the third field is to be considered as the Key for the emp RDD val manipulated_emp = emp.keyBy(t => t._3)  // Establishing that the second field need to be considered as the Key for dept RDD val manipulated_dept = dept.keyBy(t => t._2)  // Inner Join val join_data = manipulated_emp.join(manipulated_dept) // Left Outer Join val left_outer_join_data = manipulated_emp.leftOuterJoin(manipulated_dept) // Right Outer Join val right_outer_join_data = manipulated_emp.rightOuterJoin(manipulated_dept) // Full Outer Join val full_outer_join_data = manipulated_emp.fullOuterJoin(manipulated_dept)  // Formatting the Joined Data for better understandable (using map) val cleaned_joined_data = join_data.map(t => (t._2._1._1, t._2._1._2, t._1, t._2._2._1))

This will give the output as:

// Print the output cleaned_joined_data on the console

scala> cleaned_joined_data.collect() res13: Array[(Int, String, Int, String)] = Array((3,matt,30,hive), (5,rhonda,30,hive), (2,ricky,20,spark), (1,jordan,10,hadoop))

answered Sep 17 '22 17:09

New Coder

Related questions
                            
                                Why doesn't Scala have an IO Monad?
                            
                                In Scala, how would you declare static data inside a function?
                            
                                What is a DList?
                            
                                Parallel execution of tests
                            
                                Should I override the default ExecutionContext?
                            
                                How do you call a Scala singleton method from Java?
                            
                                Is there something wrong with an abstract value used in trait in scala?
                            
                                Condition in map function
                            
                                How to calculate sum and count in a single groupBy?
                            
                                Scala Map#get and the return of Some()
                            
                                Using Java libraries in Scala
                            
                                Scala IO monad: what's the point?
                            
                                Rolling your own reduceByKey in Spark Dataset
                            
                                Memory barriers and coding style over a Java VM
                            
                                Resource directory for tests in a Play application
                            
                                How to convert List to ListBuffer?
                            
                                How to compare floating point values in Scala?
                            
                                How to disable test suite in ScalaTest
                            
                                Scala and forward references [duplicate]
                            
                                Play error: value and is not a member of play.api.libs.json

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With