I frequently do analysis of the DAG of my spark job while it is running. But, it is annoying to have to sit and watch the application while it is running in order to see the DAG. So, I tried to view the DAg using this thing called the <code>spark history-server</code>, which I know should help me see past jobs. I'm easily able to access port <code>18080</code>, and I can see the history server UI. But, it doesn't show me any information related to the spark program's execution. I know I have the history server running, because when I do <code>sudo service --status-all</code> I see <code>spark history-server is running [ OK ]</code> So I already tried what this question suggested: here. I think this is because I'm running spark on YARN, and it can only use one resource manager at a time? maybe? So, how do I see the spark execution DAG, *after* a job has finished? and more specifically, when running YARN as my resource manager?

As mentioned in Monitoring and Instrumentation, we need following three parameters to be set in <code>spark-defaults.conf</code> <pre class="prettyprint"><code>spark.eventLog.enabled spark.eventLog.dir spark.history.fs.logDirectory </code></pre> The first property should be <code>true</code> <pre class="prettyprint"><code>spark.eventLog.enabled true </code></pre> The second and the third properties should point to the <code>event-log</code> locations which can either be <code>local-file-system</code> or <code>hdfs-file-system</code>. The second property defines where to store the logs for spark jobs and the third property is for history-server to display logs in web UI at 18080. If you choose <code>linux local-file-system (/opt/spark/spark-events)</code> Either <pre class="prettyprint"><code>spark.eventLog.dir file:/opt/spark/spark-events spark.history.fs.logDirectory file:/opt/spark/spark-events </code></pre> Or <pre class="prettyprint"><code>spark.eventLog.dir file:///opt/spark/spark-events spark.history.fs.logDirectory file:///opt/spark/spark-events </code></pre> should work If you choose <code>hdfs-file-system (/spark-events)</code> Either <pre class="prettyprint"><code>spark.eventLog.dir hdfs:/spark-events spark.history.fs.logDirectory hdfs:/spark-events </code></pre> Or <pre class="prettyprint"><code>spark.eventLog.dir hdfs:///spark-events spark.history.fs.logDirectory hdfs:///spark-events </code></pre> Or <pre class="prettyprint"><code>spark.eventLog.dir hdfs://masterIp:9090/spark-events spark.history.fs.logDirectory hdfs://masterIp:9090/spark-events </code></pre> should work where <code>masterIp:9090</code> is the <code>fs.default.name</code> property in <code>core-site.xml</code> of <code>hadoop</code> configuration. Apache spark history server can be started by <pre class="prettyprint"><code>$SPARK_HOME/sbin/start-history-server.sh </code></pre> Third party spark history server for example of Cloudera can be started by <pre class="prettyprint"><code>sudo service spark-history-server start </code></pre> And to stop the history server (for Apache) <pre class="prettyprint"><code>$SPARK_HOME/sbin/stop-history-server.sh </code></pre> Or (for cloudera) <pre class="prettyprint"><code>sudo service spark-history-server stop </code></pre>

Running only <code>history-server</code> is not sufficient to get execution <code>DAG</code> of previous jobs. You need specify the jobs to store the events logs of all previous jobs. Run Spark history server by <code>./sbin/start-history-server.sh</code> Enable event log for the spark job <pre class="prettyprint"><code>spark.eventLog.enabled true spark.eventLog.dir <path to event log(local or hdfs)> spark.history.fs.logDirectory <path to event log(local or hdfs)> </code></pre> Add these on <code>spark-defaults.conf</code> file

How to get execution DAG from spark web UI after job has finished running, when I am running spark on YARN?

2 Answers

As mentioned in Monitoring and Instrumentation, we need following three parameters to be set in spark-defaults.conf

spark.eventLog.enabled
spark.eventLog.dir
spark.history.fs.logDirectory

The first property should be true

spark.eventLog.enabled           true

The second and the third properties should point to the event-log locations which can either be local-file-system or hdfs-file-system. The second property defines where to store the logs for spark jobs and the third property is for history-server to display logs in web UI at 18080.

If you choose linux local-file-system (/opt/spark/spark-events)
Either

spark.eventLog.dir               file:/opt/spark/spark-events
spark.history.fs.logDirectory    file:/opt/spark/spark-events

spark.eventLog.dir               file:///opt/spark/spark-events
spark.history.fs.logDirectory    file:///opt/spark/spark-events

should work

If you choose hdfs-file-system (/spark-events)
Either

spark.eventLog.dir               hdfs:/spark-events
spark.history.fs.logDirectory    hdfs:/spark-events

spark.eventLog.dir               hdfs:///spark-events
spark.history.fs.logDirectory    hdfs:///spark-events

spark.eventLog.dir               hdfs://masterIp:9090/spark-events
spark.history.fs.logDirectory    hdfs://masterIp:9090/spark-events

should work where masterIp:9090 is the fs.default.name property in core-site.xml of hadoop configuration.

Apache spark history server can be started by

$SPARK_HOME/sbin/start-history-server.sh

Third party spark history server for example of Cloudera can be started by

sudo service spark-history-server start

And to stop the history server (for Apache)

$SPARK_HOME/sbin/stop-history-server.sh

Or (for cloudera)

sudo service spark-history-server stop

187

answered Oct 03 '22 10:10

Ramesh Maharjan

Running only history-server is not sufficient to get execution DAG of previous jobs. You need specify the jobs to store the events logs of all previous jobs.

Run Spark history server by ./sbin/start-history-server.sh

Enable event log for the spark job

spark.eventLog.enabled true
spark.eventLog.dir <path to event log(local or hdfs)>
spark.history.fs.logDirectory  <path to event log(local or hdfs)>

Add these on spark-defaults.conf file

answered Oct 03 '22 08:10

koiralo

Related questions
                            
                                Spark out of memory
                            
                                Does Spark optimize chained transformations?
                            
                                Multiple resolvers having different access mechanism configured with same name 'sbt-plugin-releases'
                            
                                Scalatest Maven Plugin "no tests were executed"
                            
                                "spark.memory.fraction" seems to have no effect
                            
                                When to use Spark DataFrame/Dataset API and when to use plain RDD?
                            
                                Apache Spark Handling Skewed Data
                            
                                Avoid starting HiveThriftServer2 with created context programmatically
                            
                                Can Spark Replace ETL Tool
                            
                                NullPointerException after extracting a Teradata table with Scala/Spark
                            
                                Bundling Python3 packages for PySpark results in missing imports
                            
                                Restarting Spark Structured Streaming Job consumes Millions of Kafka messages and dies
                            
                                Spark How to get number of Keys changed in two JSONS in Scala?
                            
                                Apache Spark: impact of repartitioning, sorting and caching on a join
                            
                                How to convert org.apache.spark.rdd.RDD[Array[Double]] to Array[Double] which is required by Spark MLlib
                            
                                Using Spark ML's OneHotEncoder on multiple columns
                            
                                Spark performs slower with hardware scaling up
                            
                                How does spark.python.worker.memory relate to spark.executor.memory?
                            
                                How do I enable partition pruning in spark
                            
                                How to read records from Kafka topic from beginning in Spark Streaming?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to get execution DAG from spark web UI after job has finished running, when I am running spark on YARN?

Tags:

apache-spark

pyspark

hadoop-yarn

makansij

People also ask

2 Answers

Ramesh Maharjan

koiralo

Recent Activity

Donate For Us