Is there a way to concatenate datasets of two different RDD
s in spark?
Requirement is - I create two intermediate RDDs using scala which has same column names, need to combine these results of both the RDDs and cache the result for accessing to UI. How do I combine the datasets here?
RDDs are of type spark.sql.SchemaRDD
In order to explain join with multiple tables, we will use Inner join, this is the default join in Spark and it's mostly used, this joins two DataFrames/Datasets on key columns, and where keys don't match the rows get dropped from both datasets.
Append or Concatenate Datasets Spark provides union() method in Dataset class to concatenate or append a Dataset to another. To append or concatenate two Datasets use Dataset. union() method on the first dataset and provide second Dataset as argument.
For example, pair RDDs have a reduceByKey() method that can aggregate data separately for each key, and a join() method that can merge two RDDs together by grouping elements with the same key.
I think you are looking for RDD.union
val rddPart1 = ??? val rddPart2 = ??? val rddAll = rddPart1.union(rddPart2)
Example (on Spark-shell)
val rdd1 = sc.parallelize(Seq((1, "Aug", 30),(1, "Sep", 31),(2, "Aug", 15),(2, "Sep", 10))) val rdd2 = sc.parallelize(Seq((1, "Oct", 10),(1, "Nov", 12),(2, "Oct", 5),(2, "Nov", 15))) rdd1.union(rdd2).collect res0: Array[(Int, String, Int)] = Array((1,Aug,30), (1,Sep,31), (2,Aug,15), (2,Sep,10), (1,Oct,10), (1,Nov,12), (2,Oct,5), (2,Nov,15))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With