Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What triggers Jobs in Spark?

Tags:

apache-spark

I'm learning how Spark works inside Databricks. I understand how shuffling causes stages within jobs, but I don't understand what causes jobs. I thought the relationship was one job per action, but sometimes many jobs happen per action.

E.g.

val initialDF = spark                                                       
  .read                                                                     
  .parquet("/mnt/training/wikipedia/pagecounts/staging_parquet_en_only_clean/")   

val someDF = initialDF
   .orderBy($"project")

someDF.show

triggers two jobs, one to peek at the schema and one to do the .show.

And the same code with .groupBy instead

val initialDF = spark                                                       
  .read                                                                     
  .parquet("/mnt/training/wikipedia/pagecounts/staging_parquet_en_only_clean/")   

val someDF = initialDF
  .groupBy($"project").sum()

someDF.show

...triggers nine jobs.

Replacing .show with .count, the .groupBy version triggers two jobs, and the .orderBy version triggers three.

Sorry I can't share the data to make this reproducible, but was hoping to understand the rules of when jobs are created in abstract. Happy to share the results of .explain if that's helpful.

like image 240
rump roast Avatar asked Jan 25 '26 08:01

rump roast


2 Answers

Normally it is 1:1 as you state. That is to say, 1 Action results in 1 Job with 1..N Stages with M Tasks per Stage, and Stages which may be skipped.

However, some Actions trigger extra Jobs 'under water'. E.g. pivot: if you pass only the columns as parameter and not the values for the pivot, then Spark has to fetch all the distinct values first so as to generate columns, performing a collect, i.e. an extra Job.

show is also a special case of extra Job(s) being generated.

like image 162
thebluephantom Avatar answered Jan 28 '26 04:01

thebluephantom


show without an argument shows the first 20 rows as a result.
When show is triggered on dataset, it gets converted to head(20) action which in turn get converted to limit(20) action .
show -> head -> limit

About limit
Spark executes limit in an incremental fashion until the limit query is satisfied.
In its first attempt, it tries to retrieve the required number of rows from one partition.
If the limit requirement was not satisfied, in the second attempt, it tries to retrieve the required number of rows from 4 partitions (determined by spark.sql.limit.scaleUpFactor, default 4). and after which 16 partitions are processed and so on until either the limit is satisfied or data is exhausted.

In each of the attempts, a separate job is spawned.

code reference: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L365

like image 22
DaRkMaN Avatar answered Jan 28 '26 03:01

DaRkMaN



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!