Concatenating datasets of different RDDs in Apache spark using scala

Tags:

Is there a way to concatenate datasets of two different RDDs in spark?

Requirement is - I create two intermediate RDDs using scala which has same column names, need to combine these results of both the RDDs and cache the result for accessing to UI. How do I combine the datasets here?

RDDs are of type spark.sql.SchemaRDD

460

asked Dec 10 '14 07:12

Atom

1 Answers

I think you are looking for RDD.union

val rddPart1 = ??? val rddPart2 = ??? val rddAll = rddPart1.union(rddPart2)

Example (on Spark-shell)

val rdd1 = sc.parallelize(Seq((1, "Aug", 30),(1, "Sep", 31),(2, "Aug", 15),(2, "Sep", 10))) val rdd2 = sc.parallelize(Seq((1, "Oct", 10),(1, "Nov", 12),(2, "Oct", 5),(2, "Nov", 15))) rdd1.union(rdd2).collect  res0: Array[(Int, String, Int)] = Array((1,Aug,30), (1,Sep,31), (2,Aug,15), (2,Sep,10), (1,Oct,10), (1,Nov,12), (2,Oct,5), (2,Nov,15))

answered Oct 08 '22 05:10

maasg

Related questions
                            
                                Best practices for mixing in Scala concurrent.Map
                            
                                map vs mapValues in Spark
                            
                                In Scala, how do I fold a List and return the intermediate results?
                            
                                How to combine multiple PNGs into one big PNG file?
                            
                                how to convert json string to dataframe on spark
                            
                                Why are List and String identifiers named "xs" (in Scala and other languages)?
                            
                                Scala: what is the purpose of 'override'
                            
                                zipWith (mapping over multiple Seq) in Scala
                            
                                Get random number between two numbers in Scala
                            
                                Scalatest - how to test println
                            
                                Alternative to Scala REPL breakIf in 2.10
                            
                                Scala macros and the JVM's method size limit
                            
                                In Scala, why can't I partially apply a function without explicitly specifying its argument types?
                            
                                Buildr vs Gradle, pros and cons? [closed]
                            
                                IntelliJ IDEA - sbt plugin - 'Expression type Def.Setting[...] must conform DslEntry in sbt file'
                            
                                Zip multiple sequences
                            
                                How to change setting inside SBT command?
                            
                                Is there any game engine in Scala? [closed]
                            
                                Why avoid subtyping?
                            
                                Iterate over lines in a file in parallel (Scala)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Concatenating datasets of different RDDs in Apache spark using scala

Tags:

scala

distributed-computing

apache-spark

rdd

apache-spark-sql

Atom

People also ask

1 Answers

maasg

Recent Activity

Donate For Us