I have Amazon EMR Hadoop v2.6 cluster with Spark 1.4.1, with Yarn resource manager. I want to deploy Zeppelin on separate machine to allow turning off EMR cluster when there is no jobs running. I tried following instruction from here https://zeppelin.incubator.apache.org/docs/install/yarn_install.html with not much of success. Can somebody demystify steps how Zeppelin should connect to existing Yarn cluster from different machine?

[1] install Zeppelin with proper params: <pre class="prettyprint"><code>git clone https://github.com/apache/incubator-zeppelin.git ~/zeppelin; cd ~/zeppelin; mvn clean package -Pspark-1.4 -Dhadoop.version=2.6.0 -Phadoop-2.6 -Pyarn -DskipTests </code></pre> [2] Update EMR_MASTER EC2 security groups to accept incoming requests from all ports, to communicate with Zeppelin (should be specific port, not yet know which) [3] Copy directory EMR_MASTER:/etc/hadoop/conf to MY_STANDALONE_SERVER:/home/zeppelin/hadoop-conf. [4] zeppelin/conf/zeppelin-env.sh should contain: <pre class="prettyprint"><code>export MASTER=yarn-client export HADOOP_CONF_DIR=/home/zeppelin/hadoop-conf </code></pre> Note: Spark parameters like <code>spark.executor.instances</code> are taken from Interpreter settings, is specified there.

How to set up Zeppelin to work with remote EMR Yarn cluster

1 Answers

[1] install Zeppelin with proper params:

git clone https://github.com/apache/incubator-zeppelin.git ~/zeppelin;
cd ~/zeppelin;
mvn clean package -Pspark-1.4 -Dhadoop.version=2.6.0 -Phadoop-2.6 -Pyarn -DskipTests

[2] Update EMR_MASTER EC2 security groups to accept incoming requests from all ports, to communicate with Zeppelin (should be specific port, not yet know which)

[3] Copy directory EMR_MASTER:/etc/hadoop/conf to MY_STANDALONE_SERVER:/home/zeppelin/hadoop-conf.

[4] zeppelin/conf/zeppelin-env.sh should contain:

export MASTER=yarn-client
export HADOOP_CONF_DIR=/home/zeppelin/hadoop-conf

Note: Spark parameters like spark.executor.instances are taken from Interpreter settings, is specified there.

150

answered Oct 16 '22 05:10

snowindy

Related questions
                            
                                SPARK, DataFrame: difference of Timestamp columns over consecutive rows
                            
                                spark kafka producer serializable
                            
                                SPARK: YARN kills containers for exceeding memory limits
                            
                                Sort by dateTime in scala
                            
                                Spark Dataframes- Reducing By Key
                            
                                How to reference a dataframe when in an UDF on another dataframe?
                            
                                NullPointerException in org.apache.spark.ml.feature.Tokenizer
                            
                                How to use Scala UDF in PySpark?
                            
                                Scala/Spark dataframes: find the column name corresponding to the max
                            
                                Apache Spark how to append new column from list/array to Spark dataframe
                            
                                Pyspark: Is there an equivalent method to pandas info()?
                            
                                Getting last value of group in Spark
                            
                                How to read streaming data in XML format from Kafka?
                            
                                How to flatten columns of type array of structs (as returned by Spark ML API)?
                            
                                Splitting a column in pyspark
                            
                                Spark: Return empty column if column does not exist in dataframe
                            
                                Apache Spark startsWith in SQL expression
                            
                                Spark AnalysisException when "flattening" DataFrame in Spark SQL
                            
                                Pyspark - Cumulative sum with reset condition
                            
                                How to find the max value of multiple columns?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to set up Zeppelin to work with remote EMR Yarn cluster

Tags:

apache-spark

hadoop-yarn

emr

apache-zeppelin

snowindy

People also ask

1 Answers

snowindy

Recent Activity

Donate For Us