What triggers Jobs in Spark?

Question

I'm learning how Spark works inside Databricks. I understand how shuffling causes stages within jobs, but I don't understand what causes jobs. I thought the relationship was one job per action, but sometimes many jobs happen per action.

E.g.

val initialDF = spark                                                       
  .read                                                                     
  .parquet("/mnt/training/wikipedia/pagecounts/staging_parquet_en_only_clean/")   

val someDF = initialDF
   .orderBy($"project")

someDF.show

triggers two jobs, one to peek at the schema and one to do the .show.

And the same code with .groupBy instead

val initialDF = spark                                                       
  .read                                                                     
  .parquet("/mnt/training/wikipedia/pagecounts/staging_parquet_en_only_clean/")   

val someDF = initialDF
  .groupBy($"project").sum()

someDF.show

...triggers nine jobs.

Replacing .show with .count, the .groupBy version triggers two jobs, and the .orderBy version triggers three.

Sorry I can't share the data to make this reproducible, but was hoping to understand the rules of when jobs are created in abstract. Happy to share the results of .explain if that's helpful.

thebluephantom · Accepted Answer

Normally it is 1:1 as you state. That is to say, 1 Action results in 1 Job with 1..N Stages with M Tasks per Stage, and Stages which may be skipped.

However, some Actions trigger extra Jobs 'under water'. E.g. pivot: if you pass only the columns as parameter and not the values for the pivot, then Spark has to fetch all the distinct values first so as to generate columns, performing a collect, i.e. an extra Job.

show is also a special case of extra Job(s) being generated.

DaRkMaN · Answer

show without an argument shows the first 20 rows as a result.
When show is triggered on dataset, it gets converted to head(20) action which in turn get converted to limit(20) action .
show -> head -> limit

About limit
Spark executes limit in an incremental fashion until the limit query is satisfied.
In its first attempt, it tries to retrieve the required number of rows from one partition.
If the limit requirement was not satisfied, in the second attempt, it tries to retrieve the required number of rows from 4 partitions (determined by spark.sql.limit.scaleUpFactor, default 4). and after which 16 partitions are processed and so on until either the limit is satisfied or data is exhausted.

In each of the attempts, a separate job is spawned.

code reference: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L365

What triggers Jobs in Spark?

Tags:

apache-spark

rump roast

2 Answers

thebluephantom

DaRkMaN

Recent Activity

Donate For Us

What triggers Jobs in Spark?

Tags:

apache-spark

rump roast

2 Answers

thebluephantom

DaRkMaN

Related questions

Recent Activity

Donate For Us