Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does stage mean in the spark logs?

When I run a job using spark I get the following logs?

[Stage 0:> (0 + 32) / 32]

Here 32 corresponds to the number of partitions of rdd that I have asked for.

However I am not getting why are there multiple stages and what exactly happens in each stage.

Each stage apparently takes a lot of time. Is it possible to get done in fewer stages?

like image 364
Harit Vishwakarma Avatar asked Oct 07 '15 14:10

Harit Vishwakarma


1 Answers

A stage in Spark represents a segment of the DAG computation that is completed locally. A stage breaks on an operation that requires a shuffling of data, which is why you'll see it named by that operation in the Spark UI. If you're using Spark 1.4+, then you can even visualize this in the UI in the DAG visualization section:

enter image description here

Notice that the split occurs at reduceByKey, which requires a shuffle to complete the full execution.

like image 180
Justin Pihony Avatar answered Sep 30 '22 05:09

Justin Pihony