When I run a job on Apache Spark, the web UI gives a view similar to this:
While this is incredibly useful for me as a developer to see where things are, I think the line numbers in the stage description would be not quite as useful for my support team. To make their job easier, I would like to have the ability to provide a bespoke name for each stage of my job, as well as for the job itself, like so:
Is this something that can be done in Spark? If so, how would I do so?
Job. A job comprises several stages. When Spark encounters a function that requires a shuffle it creates a new stage. Transformation functions like reduceByKey(), Join() etc will trigger a shuffle and will result in a new stage. Spark will also create a stage when you are reading a dataset.
Apache Spark provides a suite of Web UI/User Interfaces (Jobs, Stages, Tasks, Storage, Environment, Executors, and SQL) to monitor the status of your Spark/PySpark application, resource consumption of Spark cluster, and Spark configurations.
The web UI is intrinsically tied to the SparkContext , so if you do not call . stop and keep your application alive, then the UI should remain alive. If you need to view the logs, then those should still be persisted to the server, though.
That's where one of the very uncommon features of Spark Core called local properties applies so well.
Spark SQL uses it to group different Spark jobs under a single structured query so you can use SQL tab and navigate easily.
You can control local properties using SparkContext.setLocalProperty:
Set a local property that affects jobs submitted from this thread, such as the Spark fair scheduler pool. User-defined properties may also be set here. These properties are propagated through to worker tasks and can be accessed there via org.apache.spark.TaskContext#getLocalProperty.
web UI uses two local properties:
callSite.short
in Jobs tab (and is exactly what you want)callSite.long
in Job Details page.scala> sc.setLocalProperty("callSite.short", "callSite.short") scala> sc.setLocalProperty("callSite.long", "this is callSite.long") scala> sc.parallelize(0 to 9).count res2: Long = 10
And the result in web UI.
Click a job to see the details where you can find the longer call site, i.e. callSite.long
.
Here comes the Stages tab.
You can use the following API(s) to set and unset the stage names. https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/SparkContext.html#setCallSite-java.lang.String- https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/SparkContext.html#clearCallSite--
Also, Spark supports the concept of Job Groups within the application, following API(s) can be used to set and unset the job group names. https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/SparkContext.html#setJobGroup-java.lang.String-java.lang.String-boolean- https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/SparkContext.html#clearJobGroup--
The job description within the job group can also be configured using following API. https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/SparkContext.html#setJobDescription-java.lang.String-
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With