I just use two different way to generate Spark RDD. And the results in Spark UI DAG chart are quite different.
Can someone teach me the differences, and in my work, the first one is faster than the second one with similar operation.
In your 1 stage DAG you are simply creating the RDD with the collection and in the second RDD, you shuffle the RDD using partitionBy so your data is shuffled over the cluster. So due to shuffling the data your process is slow for the 2nd stage.
Difference between ShuffledRDD, MapPartitionsRDD and ParallelCollectionRDD:
ShuffledRDD : ShuffledRDD is created while the data is shuffled over the cluster. If you use any transformation(e.g. join,groupBy,repartition, etc.) which shuffles your data it will create a shuffledRDD.
MapPartitionsRDD : MapPartitionsRDD will be created when you use mapPartition transformation.
ParallelCollectionRDD : ParallelCollectionRDD is created when you create the RDD
with the collection object.
If you want to go more detail pls check this its make you more clear https://github.com/JerryLead/SparkInternals
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With