I'm using the following Scala code (as a custom spark-submit
wrapper) to submit a Spark application to a YARN cluster:
val result = Seq(spark_submit_script_here).!!
All I have at the time of submission is spark-submit
and the Spark application's jar (no SparkContext). I'd like to capture applicationId
from result
, but it's empty.
I can see in my command line output the applicationId and rest of the Yarn messages:
INFO yarn.Client: Application report for application_1450268755662_0110
How can I read it within code and get the applicationId ?
As stated in the Spark issue 5439, you could either use SparkContext. applicationId or parse the stderr output. Now, as you are wrapping the spark-submit command with your own script/object, I would say you need to read the stderr and get the application id. Save this answer.
Launching Spark on YARN If the configuration references Java system properties or environment variables not managed by YARN, they should also be set in the Spark application's configuration (driver, executors, and the AM when running in client mode).
In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.
As stated in the Spark issue 5439, you could either use SparkContext.applicationId
or parse the stderr output. Now, as you are wrapping the spark-submit command with your own script/object, I would say you need to read the stderr and get the application id.
If you are submitting the job via Python, then this is how you can get the yarn application id:
cmd_list = [{
'cmd': '/usr/bin/spark-submit --name %s --master yarn --deploy-mode cluster '
'--executor-memory %s --executor-cores %s --num-executors %s '
'--class %s %s %s'
% (
app_name,
config.SJ_EXECUTOR_MEMORY,
config.SJ_EXECUTOR_CORES,
config.SJ_NUM_OF_EXECUTORS,
config.PRODUCT_SNAPSHOT_SKU_PRESTO_CLASS,
config.SPARK_JAR_LOCATION,
config.SPARK_LOGGING_ENABLED
),
'cwd': config.WORK_DIR
}]
cmd_output = subprocess.run(cmd_obj['cmd'], shell=True, check=True, cwd=cwd, stderr=subprocess.PIPE)
cmd_output = cmd_output.stderr.decode("utf-8")
yarn_application_ids = re.findall(r"application_\d{13}_\d{4}", cmd_output)
if len(yarn_application_ids):
yarn_application_id = yarn_application_ids[0]
yarn_command = "yarn logs -applicationId " + yarn_application_id
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With