What does stage mean in the spark logs?

Question

When I run a job using spark I get the following logs?

[Stage 0:> (0 + 32) / 32]

Here 32 corresponds to the number of partitions of rdd that I have asked for.

However I am not getting why are there multiple stages and what exactly happens in each stage.

Each stage apparently takes a lot of time. Is it possible to get done in fewer stages?

Justin Pihony · Accepted Answer

A stage in Spark represents a segment of the DAG computation that is completed locally. A stage breaks on an operation that requires a shuffling of data, which is why you'll see it named by that operation in the Spark UI. If you're using Spark 1.4+, then you can even visualize this in the UI in the DAG visualization section:

enter image description here

Notice that the split occurs at reduceByKey, which requires a shuffle to complete the full execution.

enter image description here

Notice that the split occurs at reduceByKey, which requires a shuffle to complete the full execution.

What does stage mean in the spark logs?

Tags:

apache-spark

apache-spark-sql

pyspark

mapreduce

Harit Vishwakarma

1 Answers

Justin Pihony

Recent Activity

Donate For Us

What does stage mean in the spark logs?

Tags:

apache-spark

apache-spark-sql

pyspark

mapreduce

Harit Vishwakarma

1 Answers

Justin Pihony

Related questions

Recent Activity

Donate For Us