I am creating clusters on EMR and configure Zeppelin to read the notebooks from S3. To do that I am using a json object that looks like that:
[
{
"Classification": "zeppelin-env",
"Properties": {
},
"Configurations": [
{
"Classification": "export",
"Properties": {
"ZEPPELIN_NOTEBOOK_STORAGE":"org.apache.zeppelin.notebook.repo.S3NotebookRepo",
"ZEPPELIN_NOTEBOOK_S3_BUCKET":"hs-zeppelin-notebooks",
"ZEPPELIN_NOTEBOOK_USER":"user"
},
"Configurations": [
]
}
]
}
]
I am pasting this object in the Stoftware configuration page of EMR: My question is, how/where I can configure the Spark interpreter directly without the need to manually configure it from Zeppelin each time I start a cluster?
To supply a configuration, navigate to the Create cluster page and choose Edit software settings. You can then enter the configuration directly by using either JSON or a shorthand syntax demonstrated in shadow text in the console. Otherwise, you can provide an Amazon S3 URI for a file with a JSON Configurations object.
To submit a Spark application as a step using the consoleOpen the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/ . Select the name of your cluster from the Cluster List. The cluster state must be Waiting. Choose Steps, and then choose Add step.
This is a bit involved, you will need to do 2 things:
So what you need to do is write a shell script and then add an extra step to the EMR cluster configuration that runs this shell script.
The Zeppelin configuration is in json, you can use jq (a tool) to manipulate json. I don't know what you want to change exactly, but here is an example that adds the (mysteriously missing) DepInterpreter:
#!/bin/bash
# 1 edit the Spark interpreter
set -e
cat /etc/zeppelin/conf/interpreter.json | jq '.interpreterSettings."2ANGGHHMQ".interpreterGroup |= .+ [{"class":"org.apache.zeppelin.spark.DepInterpreter", "name":"dep"}]' | sudo -u zeppelin tee /etc/zeppelin/conf/interpreter.json
# Trigger restart of Spark interpreter
curl -X PUT http://localhost:8890/api/interpreter/setting/restart/2ANGGHHMQ
Put this shell script in a s3 bucket. Then start your EMR cluster with
--steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[s3://mybucket/script.sh]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With