I need to set a custom environment variable in EMR to be available when running a spark application.
I have tried adding this:
...
--configurations '[
{
"Classification": "spark-env",
"Configurations": [
{
"Classification": "export",
"Configurations": [],
"Properties": { "SOME-ENV-VAR": "qa1" }
}
],
"Properties": {}
}
]'
...
and also tried to replace "spark-env with hadoop-env
but nothing seems to work.
There is this answer from the aws forums. but I can't figure out how to apply it.
I'm running on EMR 5.3.1 and launch it with a preconfigured step from the cli: aws emr create-cluster...
Use classification yarn-env to pass environment variables to the worker nodes. Use classification spark-env to pass environment variables to the driver, with deploy mode client. When using deploy mode cluster, use yarn-env.
You can use AWS Step Functions to run PySpark applications as EMR Steps on an existing EMR cluster. Using Step Functions, we can also create the cluster, run multiple EMR Steps sequentially or in parallel, and finally, auto-terminate the cluster.
Add the custom configurations like below JSON to a file say, custom_config.json
[
{
"Classification": "spark-env",
"Properties": {},
"Configurations": [
{
"Classification": "export",
"Properties": {
"VARIABLE_NAME": VARIABLE_VALUE,
}
}
]
}
]
And, On creating the emr cluster, pass the file reference to the --configurations
option
aws emr create-cluster --configurations file://custom_config.json --other-options...
For me replacing spark-env to yarn-env fixed issue.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With