Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What do the numbers on the progress bar mean in spark-shell?

Tags:

apache-spark

In my spark-shell, what do entries like the below mean when I execute a function ?

[Stage7:===========>                              (14174 + 5) / 62500] 
like image 628
rmckeown Avatar asked May 14 '15 18:05

rmckeown


People also ask

What do stages mean in Pyspark?

A stage is a set of independent tasks all computing the same function that need to run as part of a Spark job, where all the tasks have the same shuffle dependencies.

What is the Spark UI?

The Spark UI enables you to check the following for each job: The event timeline of each Spark stage. A directed acyclic graph (DAG) of the job. Physical and logical plans for SparkSQL queries. The underlying Spark environmental variables for each job.


2 Answers

What you get is a Console Progress Bar, [Stage 7: shows the stage you are in now, and (14174 + 5) / 62500] is (numCompletedTasks + numActiveTasks) / totalNumOfTasksInThisStage]. The progress bar shows numCompletedTasks / totalNumOfTasksInThisStage.

It will be shown when both spark.ui.showConsoleProgress is true (by default) and log level in conf/log4j.properties is ERROR or WARN (!log.isInfoEnabled is true).

Let's see the code in ConsoleProgressBar.scala that shows it out:

private def show(now: Long, stages: Seq[SparkStageInfo]) {   val width = TerminalWidth / stages.size   val bar = stages.map { s =>     val total = s.numTasks()     val header = s"[Stage ${s.stageId()}:"     val tailer = s"(${s.numCompletedTasks()} + ${s.numActiveTasks()}) / $total]"     val w = width - header.length - tailer.length     val bar = if (w > 0) {       val percent = w * s.numCompletedTasks() / total       (0 until w).map { i =>         if (i < percent) "=" else if (i == percent) ">" else " "       }.mkString("")     } else {     ""     }     header + bar + tailer   }.mkString("")    // only refresh if it's changed of after 1 minute (or the ssh connection will be closed   // after idle some time)   if (bar != lastProgressBar || now - lastUpdateTime > 60 * 1000L) {     System.err.print(CR + bar)     lastUpdateTime = now   }   lastProgressBar = bar } 
like image 140
yjshen Avatar answered Sep 30 '22 22:09

yjshen


Let's assume you see the following (X,A,B,C are always non negative integers):

[Stage X:==========>            (A + B) / C] 

(for example in the question X=7, A=14174, B=5 and C=62500)

Here is what is going on at a high level: Spark breaks the work in stages and tasks in each stage. This progress indicator means that Stage X is comprised of C tasks. During the execution, A and B start at zero and keep changing. A is always the number of tasks already finished and B is the number of tasks currently executing. For a stage with many tasks (way more than the workers you have) you should expect to see B grow to a number that corresponds to how many workers you have in the cluster, then you should start seeing A increase as tasks complete. Towards the end, as the last few tasks execute, B will start decreasing until it reaches 0, at which point A should equal C, the stage is done, and spark moves to the next stage. C will stay constant during the whole time, remember it is the total number of tasks in the stage and never changes.

The ====> shows the percentage of work done based on what I described above. At the beginning the > will be towards the left and will be moving to the right as tasks are completed.

like image 41
gae123 Avatar answered Sep 30 '22 23:09

gae123