Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to programmatically get information about executors in PySpark

When a new pyspark application is started it creates a nice web UI with tabs for Jobs, Stages, Executors, etc. If I go to Executors tab I can see the full list of executors and some information about each executor - such as number of cores, storage memory used vs total, etc.

My question is if I can somehow access same information (or at least part of it) from the application itself programmatically, e.g. with something looking like spark.sparkContext.<function_name_to_get_info_about_executors>()?

I've found some workaround with doing url request in a way similar to webUI, but I think that maybe I'm missing a simpler solution.

I'm using Spark 3.0.0

like image 638
Alexander Pivovarov Avatar asked Jun 23 '20 02:06

Alexander Pivovarov


2 Answers

The only way I found so far seems hacky to me and involves scraping same url as what web UI querying, i.e. doing this:

import urllib.request
import json
sc = spark.sparkContext
u = sc.uiWebUrl + '/api/v1/applications/' + sc.applicationId + '/allexecutors'
with urllib.request.urlopen(u) as url:
    executors_data = json.loads(url.read().decode())
like image 145
Alexander Pivovarov Avatar answered Nov 29 '22 14:11

Alexander Pivovarov


Another option is to implement a SparkListener which would override some/all onExecutor...() methods depending on your needs, and then add it during spark-submit using --conf spark.extraListeners=<your listener class>.

Your own solution is totally legit too, it just utilizes Spark's REST API.

Both are going to be quite involved, so pick your poison -- parse long JSons or go through a hierarchy of Developer API objects.

like image 43
mazaneicha Avatar answered Nov 29 '22 14:11

mazaneicha