I was going through this Apache Spark documentation, and it mentions that:
When running Spark on YARN in
cluster
mode, environment variables need to be set using thespark.yarn.appMasterEnv.[EnvironmentVariableName]
property in yourconf/spark-defaults.conf
file.
I am running my EMR cluster on AWS data pipeline. I wanted to know that where do I have to edit this conf file. Also, if I create my own custom conf file, and specify it as part of --configurations
(in the spark-submit), will it solve my use-case?
One way to do it, is the following: (The tricky part is that you might need to setup the environment variables on both executor and driver parameters)
spark-submit \
--driver-memory 2g \
--executor-memory 4g \
--conf spark.executor.instances=4 \
--conf spark.driver.extraJavaOptions="-DENV_KEY=ENV_VALUE" \
--conf spark.executor.extraJavaOptions="-DENV_KEY=ENV_VALUE" \
--master yarn \
--deploy-mode cluster\
--class com.industry.class.name \
assembly-jar.jar
I have tested it in EMR and client mode but should work on cluster mode as well.
For future reference you could directly pass the environment variable when creating the EMR cluster using the Configurations parameter as described in the docs here.
Specifically, the spark-defaults
file can be modified by passing a configuration JSON as follows:
{
'Classification': 'spark-defaults',
'Properties': {
'spark.yarn.appMasterEnv.[EnvironmentVariableName]' = 'some_value',
'spark.executorEnv.[EnvironmentVariableName]': 'some_other_value'
}
},
Where spark.yarn.appMasterEnv.[EnvironmentVariableName]
would be used to pass a variable in cluster mode using YARN (here). And spark.executorEnv.[EnvironmentVariableName]
to pass a variable to the executor process (here).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With