What's the difference among ShuffledRDD, MapPartitionsRDD and ParallelCollectionRDD?

Question

I just use two different way to generate Spark RDD. And the results in Spark UI DAG chart are quite different.

enter image description here

Can someone teach me the differences, and in my work, the first one is faster than the second one with similar operation.

Sandeep Purohit · Accepted Answer

In your 1 stage DAG you are simply creating the RDD with the collection and in the second RDD, you shuffle the RDD using partitionBy so your data is shuffled over the cluster. So due to shuffling the data your process is slow for the 2nd stage.

Difference between ShuffledRDD, MapPartitionsRDD and ParallelCollectionRDD:

ShuffledRDD : ShuffledRDD is created while the data is shuffled over the cluster. If you use any transformation(e.g. join,groupBy,repartition, etc.) which shuffles your data it will create a shuffledRDD.

MapPartitionsRDD : MapPartitionsRDD will be created when you use mapPartition transformation.

ParallelCollectionRDD : ParallelCollectionRDD is created when you create the RDD with the collection object.

If you want to go more detail pls check this its make you more clear https://github.com/JerryLead/SparkInternals

What's the difference among ShuffledRDD, MapPartitionsRDD and ParallelCollectionRDD?

Tags:

apache-spark

rdd

pyspark

American curl

1 Answers

Sandeep Purohit

Recent Activity

Donate For Us

What's the difference among ShuffledRDD, MapPartitionsRDD and ParallelCollectionRDD?

Tags:

apache-spark

rdd

pyspark

American curl

1 Answers

Sandeep Purohit

Related questions

Recent Activity

Donate For Us