Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the progress bar (with stages and tasks) with yarn-cluster master?

When running a Spark Shell query using something like this:

spark-shell yarn --name myQuery -i ./my-query.scala

Inside my query is simple Spark SQL query where I read parquet files and run simple queries and write out parquet files. When running these queries I get a nice progress bar like this:

[Stage7:===========>                              (14174 + 5) / 62500]

When I create a jar using the exact same query and run it with the following command-line:

spark-submit \
  --master yarn-cluster \
  --driver-memory 16G \
  --queue default \
  --num-executors 5 \
  --executor-cores 4 \
  --executor-memory 32G \
  --name MyQuery \
  --class com.data.MyQuery \
  target/uber-my-query-0.1-SNAPSHOT.jar 

I don't get any such progress bar. The command simply says repeatedly

17/10/20 17:52:25 INFO yarn.Client: Application report for application_1507058523816_0443 (state: RUNNING)

The query works fine and the results are fine. But I just need to have feedback when the process will finish. I have tried the following.

  1. The web page of RUNNING Hadoop Applications does have a progress bar but it basically never moves. Even in the case of the spark-shell query that progress bar is useless.
  2. I have tried get the progress bar through the YARN logs but they are not aggregated until the job is complete. Even then there is no progress bar in the logs.

Is there is a way to launch a spark query in jar on a cluster and have a progressbar?

like image 959
swdev Avatar asked Oct 20 '17 17:10

swdev


1 Answers

When I create a jar using the exact same query and run it with the following command-line (...) I don't get any such progress bar.

The difference between these two seemingly similar Spark executions is the master URL.

In the former Spark execution with spark-shell yarn, the master is YARN in client deploy mode, i.e. the driver runs on the machine where you start spark-shell from.

In the latter Spark execution with spark-submit --master yarn-cluster, the master is YARN in cluster deploy mode (which is actually equivalent to --master yarn --deploy-mode cluster), i.e. the driver runs on a YARN node.

With that said, you won't get the nice progress bar (which is actually called ConsoleProgressBar) on the local machine but on the machine where the driver runs.

A simple solution is to replace yarn-cluster with yarn.


ConsoleProgressBar shows the progress of active stages to standard error, i.e. stderr.

The progress includes the stage id, the number of completed, active, and total tasks.

ConsoleProgressBar is created when spark.ui.showConsoleProgress Spark property is turned on and the logging level of org.apache.spark.SparkContext logger is WARN or higher (i.e. less messages are printed out and so there is a "space" for ConsoleProgressBar).

You can find more information in Mastering Apache Spark 2's ConsoleProgressBar.

like image 137
Jacek Laskowski Avatar answered Oct 19 '22 09:10

Jacek Laskowski