Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the difference among ShuffledRDD, MapPartitionsRDD and ParallelCollectionRDD?

I just use two different way to generate Spark RDD. And the results in Spark UI DAG chart are quite different.

enter image description here

enter image description here

Can someone teach me the differences, and in my work, the first one is faster than the second one with similar operation.

like image 328
American curl Avatar asked Oct 13 '16 05:10

American curl


1 Answers

In your 1 stage DAG you are simply creating the RDD with the collection and in the second RDD, you shuffle the RDD using partitionBy so your data is shuffled over the cluster. So due to shuffling the data your process is slow for the 2nd stage.

Difference between ShuffledRDD, MapPartitionsRDD and ParallelCollectionRDD:

ShuffledRDD : ShuffledRDD is created while the data is shuffled over the cluster. If you use any transformation(e.g. join,groupBy,repartition, etc.) which shuffles your data it will create a shuffledRDD.

MapPartitionsRDD : MapPartitionsRDD will be created when you use mapPartition transformation.

ParallelCollectionRDD : ParallelCollectionRDD is created when you create the RDD with the collection object.

If you want to go more detail pls check this its make you more clear https://github.com/JerryLead/SparkInternals

like image 191
Sandeep Purohit Avatar answered Oct 05 '22 23:10

Sandeep Purohit