Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get applicationId of Spark application deployed to YARN in Scala?

I'm using the following Scala code (as a custom spark-submit wrapper) to submit a Spark application to a YARN cluster:

val result = Seq(spark_submit_script_here).!!

All I have at the time of submission is spark-submit and the Spark application's jar (no SparkContext). I'd like to capture applicationId from result, but it's empty.

I can see in my command line output the applicationId and rest of the Yarn messages:

INFO yarn.Client: Application report for application_1450268755662_0110

How can I read it within code and get the applicationId ?

like image 305
nish1013 Avatar asked Jan 04 '16 09:01

nish1013


People also ask

How do I get YARN application ID from Spark-submit?

As stated in the Spark issue 5439, you could either use SparkContext. applicationId or parse the stderr output. Now, as you are wrapping the spark-submit command with your own script/object, I would say you need to read the stderr and get the application id. Save this answer.

How do you integrate Spark with YARN?

Launching Spark on YARN If the configuration references Java system properties or environment variables not managed by YARN, they should also be set in the Spark application's configuration (driver, executors, and the AM when running in client mode).

Where does Spark drive in YARN?

In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.


2 Answers

As stated in the Spark issue 5439, you could either use SparkContext.applicationId or parse the stderr output. Now, as you are wrapping the spark-submit command with your own script/object, I would say you need to read the stderr and get the application id.

like image 184
Markon Avatar answered Oct 02 '22 01:10

Markon


If you are submitting the job via Python, then this is how you can get the yarn application id:

    cmd_list = [{
            'cmd': '/usr/bin/spark-submit --name %s --master yarn --deploy-mode cluster '
                   '--executor-memory %s --executor-cores %s --num-executors %s '
                   '--class %s %s %s'
                   % (
                       app_name,
                       config.SJ_EXECUTOR_MEMORY,
                       config.SJ_EXECUTOR_CORES,
                       config.SJ_NUM_OF_EXECUTORS,
                       config.PRODUCT_SNAPSHOT_SKU_PRESTO_CLASS,
                       config.SPARK_JAR_LOCATION,
                       config.SPARK_LOGGING_ENABLED
                   ),
            'cwd': config.WORK_DIR
        }]
cmd_output = subprocess.run(cmd_obj['cmd'], shell=True, check=True, cwd=cwd, stderr=subprocess.PIPE)
cmd_output = cmd_output.stderr.decode("utf-8")
yarn_application_ids = re.findall(r"application_\d{13}_\d{4}", cmd_output)
                if len(yarn_application_ids):
                    yarn_application_id = yarn_application_ids[0]
                    yarn_command = "yarn logs -applicationId " + yarn_application_id
like image 31
Rajiv Avatar answered Oct 02 '22 01:10

Rajiv