Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Configure Zeppelin's Spark Interpreter on EMR when starting a cluster

I am creating clusters on EMR and configure Zeppelin to read the notebooks from S3. To do that I am using a json object that looks like that:

[
  {
    "Classification": "zeppelin-env",
    "Properties": {

    },
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
        "ZEPPELIN_NOTEBOOK_STORAGE":"org.apache.zeppelin.notebook.repo.S3NotebookRepo",
          "ZEPPELIN_NOTEBOOK_S3_BUCKET":"hs-zeppelin-notebooks",
          "ZEPPELIN_NOTEBOOK_USER":"user"
        },
        "Configurations": [

        ]
      }
    ]
  }
]

I am pasting this object in the Stoftware configuration page of EMR: enter image description here My question is, how/where I can configure the Spark interpreter directly without the need to manually configure it from Zeppelin each time I start a cluster?

like image 420
Rami Avatar asked Jul 26 '17 13:07

Rami


People also ask

How do I add configurations to EMR cluster?

To supply a configuration, navigate to the Create cluster page and choose Edit software settings. You can then enter the configuration directly by using either JSON or a shorthand syntax demonstrated in shadow text in the console. Otherwise, you can provide an Amazon S3 URI for a file with a JSON Configurations object.

How do I run Spark application on Amazon EMR Elasticmapreduce cluster?

To submit a Spark application as a step using the consoleOpen the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/ . Select the name of your cluster from the Cluster List. The cluster state must be Waiting. Choose Steps, and then choose Add step.


1 Answers

This is a bit involved, you will need to do 2 things:

  1. Edit the interpreter.json of Zeppelin
  2. Restart the interpreter

So what you need to do is write a shell script and then add an extra step to the EMR cluster configuration that runs this shell script.

The Zeppelin configuration is in json, you can use jq (a tool) to manipulate json. I don't know what you want to change exactly, but here is an example that adds the (mysteriously missing) DepInterpreter:

#!/bin/bash

# 1 edit the Spark interpreter
set -e
cat /etc/zeppelin/conf/interpreter.json | jq '.interpreterSettings."2ANGGHHMQ".interpreterGroup |= .+ [{"class":"org.apache.zeppelin.spark.DepInterpreter", "name":"dep"}]' | sudo -u zeppelin tee /etc/zeppelin/conf/interpreter.json


# Trigger restart of Spark interpreter
curl -X PUT http://localhost:8890/api/interpreter/setting/restart/2ANGGHHMQ

Put this shell script in a s3 bucket. Then start your EMR cluster with

--steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[s3://mybucket/script.sh]
like image 143
rdeboo Avatar answered Sep 24 '22 06:09

rdeboo