Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to fetch Spark Streaming job statistics using REST calls when running in yarn-cluster mode

I have a spark streaming program running on Yarn Cluster in "yarn-cluster" mode. (-master yarn-cluster). I want to fetch spark job statistics using REST APIs in json format. I am able to fetch basic statistics using REST url call: http://yarn-cluster:8088/proxy/application_1446697245218_0091/metrics/json. But this is giving very basic statistics.

However I want to fetch per executor or per RDD based statistics. How to do that using REST calls and where I can find the exact REST url to get these statistics. Though $SPARK_HOME/conf/metrics.properties file sheds some light regarding urls i.e.

5. MetricsServlet is added by default as a sink in master, worker and client driver, you can send http request "/metrics/json" to get a snapshot of all the registered metrics in json format. For master, requests "/metrics/master/json" and "/metrics/applications/json" can be sent seperately to get metrics snapshot of instance master and applications. MetricsServlet may not be configured by self.

but that is fetching html pages not json. Only "/metrics/json" fetches stats in json format. On top of that knowing application_id pro-grammatically is a challenge in itself when running in yarn-cluster mode.

I checked REST API section of Spark Monitoring page, but that didn't worked when we run spark job in yarn-cluster mode. Any pointers/answers are welcomed.

like image 527
ramanKC Avatar asked Dec 29 '15 08:12

ramanKC


People also ask

Do you need to install spark on all nodes of the YARN cluster while running spark on YARN?

No, it is not necessary to install Spark on all the 3 nodes. Since spark runs on top of Yarn, it utilizes yarn for the execution of its commands over the cluster's nodes. So, you just have to install Spark on one node.

How do I run a spark job in cluster mode?

In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.

Which among the following can act as a data source for spark Streaming?

Spark Streaming has two categories of streaming sources. Basic sources: Sources directly available in the StreamingContext API. Example: file systems, socket connections, and Akka actors. Advanced sources: Sources like Kafka, Flume, Kinesis, Twitter, etc.


3 Answers

You should be able to access the Spark REST API using:

http://yarn-cluster:8088/proxy/application_1446697245218_0091/api/v1/applications/

From here you can select the app-id from the list and then use the following endpoint to get information about executors, for example:

http://yarn-cluster:8088/proxy/application_1446697245218_0091/api/v1/applications/{app-id}/executors

I verified this with my spark streaming application that is running in yarn cluster mode.

I'll explain how I arrived at the JSON response using a web browser. (This is for a Spark 1.5.2 streaming application in yarn-cluster mode).

First, use the hadoop url to view the RUNNING applications. http://{yarn-cluster}:8088/cluster/apps/RUNNING.

Next, select a running application, say http://{yarn-cluster}:8088/cluster/app/application_1450927949656_0021.

Next, click on the TrackingUrl link. This uses a proxy and the port is different in my case: http://{yarn-proxy}l:20888/proxy/application_1450927949656_0021/. This shows the spark UI. Now, append the api/v1/applications to this URL: http://{yarn-proxy}l:20888/proxy/application_1450927949656_0021/api/v1/applications.

You should see a JSON response with the application name supplied to SparkConf and the start time of the application.

like image 80
user5728085 Avatar answered Oct 13 '22 13:10

user5728085


I was able to reconstruct the metrics in the columns seen in the Spark Streaming web UI (batch start time, processing delay, scheduling delay) using the /jobs/ endpoint.

The script I used is available here. I wrote a short post describing and tying its functionality back to the Spark codebase. This does not need any web-scraping.

It works for Spark 2.0.0 and YARN 2.7.2, but may work for other version combinations too.

like image 3
Emaad Ahmed Manzoor Avatar answered Oct 13 '22 13:10

Emaad Ahmed Manzoor


You'll need to scrape through the HTML page to get the relevant metrics. There isn't a Spark rest endpoint for capturing this info.

like image 1
Sachin Avatar answered Oct 13 '22 13:10

Sachin