Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to set up Zeppelin to work with remote EMR Yarn cluster

I have Amazon EMR Hadoop v2.6 cluster with Spark 1.4.1, with Yarn resource manager. I want to deploy Zeppelin on separate machine to allow turning off EMR cluster when there is no jobs running.

I tried following instruction from here https://zeppelin.incubator.apache.org/docs/install/yarn_install.html with not much of success.

Can somebody demystify steps how Zeppelin should connect to existing Yarn cluster from different machine?

like image 964
snowindy Avatar asked Sep 15 '15 18:09

snowindy


People also ask

What is AWS Zeppelin?

Zeppelin enables data-driven, interactive data analytics and document collaboration using a number of interpreters such as Scala (with Apache Spark), Python (with Apache Spark), Spark SQL, JDBC, Markdown, Shell and so on. Zeppelin is one of the core applications supported natively by Amazon EMR.


1 Answers

[1] install Zeppelin with proper params:

git clone https://github.com/apache/incubator-zeppelin.git ~/zeppelin;
cd ~/zeppelin;
mvn clean package -Pspark-1.4 -Dhadoop.version=2.6.0 -Phadoop-2.6 -Pyarn -DskipTests

[2] Update EMR_MASTER EC2 security groups to accept incoming requests from all ports, to communicate with Zeppelin (should be specific port, not yet know which)

[3] Copy directory EMR_MASTER:/etc/hadoop/conf to MY_STANDALONE_SERVER:/home/zeppelin/hadoop-conf.

[4] zeppelin/conf/zeppelin-env.sh should contain:

export MASTER=yarn-client
export HADOOP_CONF_DIR=/home/zeppelin/hadoop-conf

Note: Spark parameters like spark.executor.instances are taken from Interpreter settings, is specified there.

like image 150
snowindy Avatar answered Oct 16 '22 05:10

snowindy