How does Spark achieve sort order?

Tags:

Assume I have a list of Strings. I filter & sort them, and collect the result to driver. However, things are distributed, and each RDD has it's own part of original list. So, how does Spark achieve the final sorted order, does it merge results?

595

asked Oct 01 '15 12:10

dveim

1 Answers

Sorting in Spark is a multiphase process which requires shuffling:

input RDD is sampled and this sample is used to compute boundaries for each output partition (sample followed by collect)
input RDD is partitioned using rangePartitioner with boundaries computed in the first step (partitionBy)
each partition from the second step is sorted locally (mapPartitions)

When the data is collected, all that is left is to follow the order defined by the partitioner.

Above steps are clearly reflected in a debug string:

scala> val rdd = sc.parallelize(Seq(4, 2, 5, 3, 1)) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at ...  scala> rdd.sortBy(identity).toDebugString res1: String =  (6) MapPartitionsRDD[10] at sortBy at <console>:24 [] // Sort partitions  |  ShuffledRDD[9] at sortBy at <console>:24 [] // Shuffle  +-(8) MapPartitionsRDD[6] at sortBy at <console>:24 [] // Pre-shuffle steps     |  ParallelCollectionRDD[0] at parallelize at <console>:21 [] // Parallelize

answered Sep 21 '22 15:09

zero323

Related questions
                            
                                Show .idea folder in PhpStorm project tool window
                            
                                How to read from XLSX (Excel)?
                            
                                Get pixel data as array from UIImage/CGImage in swift
                            
                                Nodejs, express routes as es6 classes
                            
                                Is it possible to store integer value in localStorage like in Javascript objects and extract it without typecasting?
                            
                                What are .rej files which are created during merge
                            
                                Get bean from ApplicationContext by qualifier
                            
                                How does Angular handle XSS or CSRF?
                            
                                MongoDB: Server has startup warnings [duplicate]
                            
                                What is the difference between these 3 ways of declaring a string in Rust?
                            
                                What is diffrence between lock() and expired()? weak_ptr C++
                            
                                Can't use Stripe in iOS Apps?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With