Why does sortBy transformation trigger a Spark job?

2 Answers

sortBy is implemented using sortByKey which depends on a RangePartitioner (JVM) or partitioning function (Python). When you call sortBy / sortByKey partitioner (partitioning function) is initialized eagerly and samples input RDD to compute partition boundaries. Job you see corresponds to this process.

Actual sorting is performed only if you execute an action on the newly created RDD or its descendants.

129

answered Sep 30 '22 03:09

zero323

As per Spark documentation only the action triggers a job in Spark, the transformations are lazily evaluated when an action is called on it.

In general you're right, but as you've just experienced, there are few exceptions and sortBy is among them (with zipWithIndex).

As a matter of fact, it was reported in Spark's JIRA and closed with Won't Fix resolution. See SPARK-1021 sortByKey() launches a cluster job when it shouldn't.

You can see the job running with DAGScheduler logging enabled (and later in web UI):

scala> sc.parallelize(0 to 8).sortBy(identity)
INFO DAGScheduler: Got job 1 (sortBy at <console>:25) with 8 output partitions
INFO DAGScheduler: Final stage: ResultStage 1 (sortBy at <console>:25)
INFO DAGScheduler: Parents of final stage: List()
INFO DAGScheduler: Missing parents: List()
DEBUG DAGScheduler: submitStage(ResultStage 1)
DEBUG DAGScheduler: missing: List()
INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[4] at sortBy at <console>:25), which has no missing parents
DEBUG DAGScheduler: submitMissingTasks(ResultStage 1)
INFO DAGScheduler: Submitting 8 missing tasks from ResultStage 1 (MapPartitionsRDD[4] at sortBy at <console>:25)
DEBUG DAGScheduler: New pending partitions: Set(0, 1, 5, 2, 6, 3, 7, 4)
INFO DAGScheduler: ResultStage 1 (sortBy at <console>:25) finished in 0.013 s
DEBUG DAGScheduler: After removal of stage 1, remaining stages = 0
INFO DAGScheduler: Job 1 finished: sortBy at <console>:25, took 0.019755 s
res1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[6] at sortBy at <console>:25

answered Sep 30 '22 02:09

Jacek Laskowski

Related questions
                            
                                Attach metadata to vector column in Spark
                            
                                how to add a Incremental column ID for a table in spark SQL
                            
                                pyspark: sparse vectors to scipy sparse matrix
                            
                                how to order my tuple of spark results descending order using value
                            
                                spark-submit for a .scala file
                            
                                Setting YARN queue in PySpark
                            
                                Apache Spark Stderr and Stdout
                            
                                Apache Spark : JDBC connection not working
                            
                                Can I change SparkContext.appName on the fly?
                            
                                Building Apache Spark using SBT: Invalid or corrupt jarfile
                            
                                How to transform data with sliding window over time series data in Pyspark
                            
                                Could you give me any clue Why 'Cannot call methods on a stopped SparkContext'?
                            
                                PySpark: Randomize rows in dataframe
                            
                                Spark "replacing null with 0" performance comparison
                            
                                Can SparkContext and StreamingContext co-exist in the same program?
                            
                                How to find pyspark dataframe memory usage?
                            
                                How to do count(*) within a spark dataframe groupBy
                            
                                User defined function to be applied to Window in PySpark?
                            
                                How does the fold action work in Spark?
                            
                                Calculating percentage of total count for groupBy using pyspark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does sortBy transformation trigger a Spark job?

Tags:

apache-spark

rdd

partitioning

partitioner

Prabu Soundar Rajan

People also ask

2 Answers

zero323

Jacek Laskowski

Recent Activity

Donate For Us