When I run a job using spark I get the following logs?
[Stage 0:> (0 + 32) / 32]
Here 32 corresponds to the number of partitions of rdd that I have asked for.
However I am not getting why are there multiple stages and what exactly happens in each stage.
Each stage apparently takes a lot of time. Is it possible to get done in fewer stages?
A stage in Spark represents a segment of the DAG computation that is completed locally. A stage breaks on an operation that requires a shuffling of data, which is why you'll see it named by that operation in the Spark UI. If you're using Spark 1.4+, then you can even visualize this in the UI in the DAG visualization section:
Notice that the split occurs at reduceByKey
, which requires a shuffle to complete the full execution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With