Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does "Stage Skipped" mean in Apache Spark web UI?

From my Spark UI. What does it mean by skipped?

enter image description here

like image 284
Aravind Yarram Avatar asked Jan 03 '16 19:01

Aravind Yarram


People also ask

What is skipped stage in Spark?

Skipped stages are cached stages marked in grey, where computation values are stored in memory and not recomputed after accessing HDFS. A glance at the DAG visualization is enough to know if RDD computations are repeatedly performed or cached stages are used.

What does stage mean in Spark?

A stage is a set of independent tasks all computing the same function that need to run as part of a Spark job, where all the tasks have the same shuffle dependencies.

What are stages in Apache spark?

In Apache Spark, a stage is a physical unit of execution. We can say, it is a step in a physical execution plan. It is a set of parallel tasks — one task per partition. In other words, each job gets divided into smaller sets of tasks, is what you call stages.

What is result stage in Spark?

ResultStage in SparkBy running a function on a spark RDD Stage that executes a Spark action in a user program is a ResultStage. It is considered as a final stage in spark. ResultStage implies as a final stage in a job that applies a function on one or many partitions of the target RDD in Spark.


1 Answers

Typically it means that data has been fetched from cache and there was no need to re-execute given stage. It is consistent with your DAG which shows that the next stage requires shuffling (reduceByKey). Whenever there is shuffling involved Spark automatically caches generated data:

Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files are preserved until the corresponding RDDs are no longer used and are garbage collected. This is done so the shuffle files don’t need to be re-created if the lineage is re-computed.

like image 82
zero323 Avatar answered Oct 13 '22 09:10

zero323