Configure Zeppelin's Spark Interpreter on EMR when starting a cluster

Tags:

I am creating clusters on EMR and configure Zeppelin to read the notebooks from S3. To do that I am using a json object that looks like that:

[
  {
    "Classification": "zeppelin-env",
    "Properties": {

    },
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
        "ZEPPELIN_NOTEBOOK_STORAGE":"org.apache.zeppelin.notebook.repo.S3NotebookRepo",
          "ZEPPELIN_NOTEBOOK_S3_BUCKET":"hs-zeppelin-notebooks",
          "ZEPPELIN_NOTEBOOK_USER":"user"
        },
        "Configurations": [

        ]
      }
    ]
  }
]

I am pasting this object in the Stoftware configuration page of EMR: enter image description here My question is, how/where I can configure the Spark interpreter directly without the need to manually configure it from Zeppelin each time I start a cluster?

420

asked Jul 26 '17 13:07

Rami

1 Answers

This is a bit involved, you will need to do 2 things:

Edit the interpreter.json of Zeppelin
Restart the interpreter

So what you need to do is write a shell script and then add an extra step to the EMR cluster configuration that runs this shell script.

The Zeppelin configuration is in json, you can use jq (a tool) to manipulate json. I don't know what you want to change exactly, but here is an example that adds the (mysteriously missing) DepInterpreter:

#!/bin/bash

# 1 edit the Spark interpreter
set -e
cat /etc/zeppelin/conf/interpreter.json | jq '.interpreterSettings."2ANGGHHMQ".interpreterGroup |= .+ [{"class":"org.apache.zeppelin.spark.DepInterpreter", "name":"dep"}]' | sudo -u zeppelin tee /etc/zeppelin/conf/interpreter.json


# Trigger restart of Spark interpreter
curl -X PUT http://localhost:8890/api/interpreter/setting/restart/2ANGGHHMQ

Put this shell script in a s3 bucket. Then start your EMR cluster with

--steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[s3://mybucket/script.sh]

143

answered Sep 24 '22 06:09

rdeboo

Related questions
                            
                                Spark executor lost because of time out even after setting quite long time out value 1000 seconds
                            
                                Run 3000+ Random Forest Models By Group Using Spark MLlib Scala API
                            
                                Understanding treeReduce() in Spark
                            
                                Find name of currently running SparkContext
                            
                                What does the Spark UI light blue part of Tasks progress bar indicate?
                            
                                collect RDD with buffer in pyspark
                            
                                Spark, DataFrame: apply transformer/estimator on groups
                            
                                Spark SQL package not found
                            
                                Re-using A Schema from JSON within a Spark DataFrame using Scala
                            
                                Reading large file in Spark issue - python
                            
                                spark executor out of memory in join and reduceByKey
                            
                                Cannot load main class from JAR file
                            
                                How to do non-random Dataset splitting on Apache Spark?
                            
                                How save list to file in spark?
                            
                                PySpark - Add a new nested column or change the value of existing nested columns
                            
                                SparkContext setLocalProperties
                            
                                How to find first non-null values in groups? (secondary sorting using dataset api)
                            
                                Difference between combinebykey and aggregatebykey
                            
                                Is it possible to read pdf/audio/video files(unstructured data) using Apache Spark?
                            
                                Can we able to use mulitple sparksessions to access two different Hive servers

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Configure Zeppelin's Spark Interpreter on EMR when starting a cluster

Tags:

apache-spark

emr

amazon-emr

apache-zeppelin

Rami

People also ask

1 Answers

rdeboo

Recent Activity

Donate For Us