Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get execution DAG from spark web UI after job has finished running, when I am running spark on YARN?

I frequently do analysis of the DAG of my spark job while it is running. But, it is annoying to have to sit and watch the application while it is running in order to see the DAG.

So, I tried to view the DAg using this thing called the spark history-server, which I know should help me see past jobs. I'm easily able to access port 18080, and I can see the history server UI.

But, it doesn't show me any information related to the spark program's execution. I know I have the history server running, because when I do sudo service --status-all I see

spark history-server is running [ OK ]

So I already tried what this question suggested: here.

I think this is because I'm running spark on YARN, and it can only use one resource manager at a time? maybe?

So, how do I see the spark execution DAG, *after* a job has finished? and more specifically, when running YARN as my resource manager?

like image 336
makansij Avatar asked May 17 '17 04:05

makansij


People also ask

Where is the DAG in Spark UI?

If we click the 'show at <console>: 24' link of the last query, we will see the DAG and details of the query execution. The query details page displays information about the query execution time, its duration, the list of associated jobs, and the query execution DAG.

How do you know if YARN is running on Spark?

If it says yarn - it's running on YARN... if it shows a URL of the form spark://... it's a standalone cluster. This might seem like a silly question, but if I run yarn application -list and my process ID is in the output, then this must mean that it's running in yarn mode, right?

How do I check the status of my Spark job?

Viewing Spark Job Progress When you run a code in the Jupyter notebook, you can see the progress of the Spark job at each cell. You can view the details of that Spark job by clicking on the View Details hyperlink.


2 Answers

As mentioned in Monitoring and Instrumentation, we need following three parameters to be set in spark-defaults.conf

spark.eventLog.enabled
spark.eventLog.dir
spark.history.fs.logDirectory

The first property should be true

spark.eventLog.enabled           true

The second and the third properties should point to the event-log locations which can either be local-file-system or hdfs-file-system. The second property defines where to store the logs for spark jobs and the third property is for history-server to display logs in web UI at 18080.

If you choose linux local-file-system (/opt/spark/spark-events)
Either

spark.eventLog.dir               file:/opt/spark/spark-events
spark.history.fs.logDirectory    file:/opt/spark/spark-events

Or

spark.eventLog.dir               file:///opt/spark/spark-events
spark.history.fs.logDirectory    file:///opt/spark/spark-events

should work

If you choose hdfs-file-system (/spark-events)
Either

spark.eventLog.dir               hdfs:/spark-events
spark.history.fs.logDirectory    hdfs:/spark-events

Or

spark.eventLog.dir               hdfs:///spark-events
spark.history.fs.logDirectory    hdfs:///spark-events

Or

spark.eventLog.dir               hdfs://masterIp:9090/spark-events
spark.history.fs.logDirectory    hdfs://masterIp:9090/spark-events

should work where masterIp:9090 is the fs.default.name property in core-site.xml of hadoop configuration.

Apache spark history server can be started by

$SPARK_HOME/sbin/start-history-server.sh

Third party spark history server for example of Cloudera can be started by

sudo service spark-history-server start

And to stop the history server (for Apache)

$SPARK_HOME/sbin/stop-history-server.sh

Or (for cloudera)

sudo service spark-history-server stop
like image 187
Ramesh Maharjan Avatar answered Oct 03 '22 10:10

Ramesh Maharjan


Running only history-server is not sufficient to get execution DAG of previous jobs. You need specify the jobs to store the events logs of all previous jobs.

Run Spark history server by ./sbin/start-history-server.sh

Enable event log for the spark job

spark.eventLog.enabled true
spark.eventLog.dir <path to event log(local or hdfs)>
spark.history.fs.logDirectory  <path to event log(local or hdfs)>

Add these on spark-defaults.conf file

like image 44
koiralo Avatar answered Oct 03 '22 08:10

koiralo