Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Concatenating datasets of different RDDs in Apache spark using scala

Is there a way to concatenate datasets of two different RDDs in spark?

Requirement is - I create two intermediate RDDs using scala which has same column names, need to combine these results of both the RDDs and cache the result for accessing to UI. How do I combine the datasets here?

RDDs are of type spark.sql.SchemaRDD

like image 460
Atom Avatar asked Dec 10 '14 07:12

Atom


People also ask

How do I join multiple DataFrames in Spark Scala?

In order to explain join with multiple tables, we will use Inner join, this is the default join in Spark and it's mostly used, this joins two DataFrames/Datasets on key columns, and where keys don't match the rows get dropped from both datasets.

How do I merge Datasets in Spark?

Append or Concatenate Datasets Spark provides union() method in Dataset class to concatenate or append a Dataset to another. To append or concatenate two Datasets use Dataset. union() method on the first dataset and provide second Dataset as argument.

What RDD transformation can be used to combine two RDDs?

For example, pair RDDs have a reduceByKey() method that can aggregate data separately for each key, and a join() method that can merge two RDDs together by grouping elements with the same key.


1 Answers

I think you are looking for RDD.union

val rddPart1 = ??? val rddPart2 = ??? val rddAll = rddPart1.union(rddPart2) 

Example (on Spark-shell)

val rdd1 = sc.parallelize(Seq((1, "Aug", 30),(1, "Sep", 31),(2, "Aug", 15),(2, "Sep", 10))) val rdd2 = sc.parallelize(Seq((1, "Oct", 10),(1, "Nov", 12),(2, "Oct", 5),(2, "Nov", 15))) rdd1.union(rdd2).collect  res0: Array[(Int, String, Int)] = Array((1,Aug,30), (1,Sep,31), (2,Aug,15), (2,Sep,10), (1,Oct,10), (1,Nov,12), (2,Oct,5), (2,Nov,15)) 
like image 79
maasg Avatar answered Oct 08 '22 05:10

maasg