A previous question recommends sc.applicationId
, but it is not present in PySpark
, only in scala
.
So, how do I figure out the application id (for yarn
) of my PySpark process?
In Spark we can get the Spark Application ID inside the Task programmatically using: SparkEnv. get. blockManager.
applicationId. A unique identifier for the Spark application. Its format depends on the scheduler implementation. in case of local spark app something like 'local-1433865536131' in case of YARN something like 'application_1433865536131_34483'
In Spark/PySpark you can get the current active SparkContext and its configuration settings by accessing spark. sparkContext. getConf. getAll() , here spark is an object of SparkSession and getAll() returns Array[(String, String)] , let's see with examples using Spark with Scala & PySpark (Spark with Python).
A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster. Only one SparkContext should be active per JVM. You must stop() the active SparkContext before creating a new one.
You could use Java SparkContext object through the Py4J RPC gateway:
>>> sc._jsc.sc().applicationId()
u'application_1433865536131_34483'
Please note that sc._jsc
is internal variable and not the part of public API - so there is (rather small) chance that it may be changed in the future.
I'll submit pull request to add public API call for this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With