Interpretting Spark Stage Output Log

Question

When running a spark job on an AWS cluster I believe that I've changed my code properly to distribute both the data and the work of the algorithm I am using. But the output looks like this:

[Stage 3:>                                                       (0 + 2) / 1000]
[Stage 3:>                                                       (1 + 2) / 1000]
[Stage 3:>                                                       (2 + 2) / 1000]
[Stage 3:>                                                       (3 + 2) / 1000]
[Stage 3:>                                                       (4 + 2) / 1000]
[Stage 3:>                                                       (5 + 2) / 1000]
[Stage 3:>                                                       (6 + 2) / 1000]
[Stage 3:>                                                       (7 + 2) / 1000]
[Stage 3:>                                                       (8 + 2) / 1000]
[Stage 3:>                                                       (9 + 2) / 1000]
[Stage 3:>                                                      (10 + 2) / 1000]
[Stage 3:>                                                      (11 + 2) / 1000]
[Stage 3:>                                                      (12 + 2) / 1000]
[Stage 3:>                                                      (13 + 2) / 1000]
[Stage 3:>                                                      (14 + 2) / 1000]
[Stage 3:>                                                      (15 + 2) / 1000]
[Stage 3:>                                                      (16 + 2) / 1000]

Am I correct to interpret the 0 + 2 / 1000 as only one two core processor carrying out one of the 1000 tasks at a time? With 5 nodes (10 processors) why wouldn't I see 0 + 10 / 1000?

Aravind Yarram · Accepted Answer

There are total 1000 tasks to be completed. 2 cores are being used to complete the 1000 tasks. I am not sure about your setup (and never used AWS cluster) but I'd like you to check spark.cores.max in your spark configuration. This specifies the maximum number of cores to be used across all your executors. It would be also useful if you can show the contents of Executors tab of your job's spark UI

user1518003 · Answer

This looks more like the output I wanted:

[Stage 2:=======>                                             (143 + 20) / 1000]
[Stage 2:=========>                                           (188 + 20) / 1000]
[Stage 2:===========>                                         (225 + 20) / 1000]
[Stage 2:==============>                                      (277 + 20) / 1000]
[Stage 2:=================>                                   (326 + 20) / 1000]
[Stage 2:==================>                                  (354 + 20) / 1000]
[Stage 2:=====================>                               (405 + 20) / 1000]
[Stage 2:========================>                            (464 + 21) / 1000]
[Stage 2:===========================>                         (526 + 20) / 1000]
[Stage 2:===============================>                     (588 + 20) / 1000]
[Stage 2:=================================>                   (633 + 20) / 1000]
[Stage 2:====================================>                (687 + 20) / 1000]
[Stage 2:=======================================>             (752 + 20) / 1000]
[Stage 2:===========================================>         (824 + 20) / 1000]

In AWS EMR make sure that the --executor-cores option is set to the number of nodes you are using like this: enter image description here

Interpretting Spark Stage Output Log

Tags:

task

apache-spark

stage

user1518003

Video Answer

2 Answers

Aravind Yarram

user1518003

Recent Activity

Donate For Us

Interpretting Spark Stage Output Log

Tags:

task

apache-spark

stage

user1518003

Video Answer

2 Answers

Aravind Yarram

user1518003

Related questions

Recent Activity

Donate For Us