Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to set up Spark with Zookeeper for HA?

I want to config Apache spark master to connect with Zookeeper

I have installed both of them and run Zookeeper.

In spark-env.sh, I add 2 lines:

-Dspark.deploy.recoveryMode=ZOOKEEPER

-Dspark.deploy.zookeeper.url=localhost:2181

But when I start Apache spark with ./sbin/start-all.sh

It shows errors

/home/deploy/spark-1.0.0/sbin/../conf/spark-env.sh: line 46: -Dspark.deploy.recoveryMode=ZOOKEEPER: command not found

/home/deploy/spark-1.0.0/sbin/../conf/spark-env.sh: line 47: -Dspark.deploy.zookeeper.url=localhost:2181: command not found

I want to know how to add Zookeeper settings on spark-env.sh

like image 244
Minh Ha Pham Avatar asked Jun 12 '14 12:06

Minh Ha Pham


People also ask

Is Zookeeper required for Spark?

First we need to have an established Zookeeper cluster. Start the Spark Master on multiple nodes and ensure that these nodes have the same Zookeeper configuration for ZooKeeper URL and directory.

How do I run a Spark job in standalone mode?

To install Spark Standalone mode, you simply place a compiled version of Spark on each node on the cluster. You can obtain pre-built versions of Spark with each release or build it yourself.

How to integrate zookeeper with spark?

First we need to have an established Zookeeper cluster. Start the Spark Master on multiple nodes and ensure that these nodes have the same Zookeeper configuration for ZooKeeper URL and directory. The master can be added or removed at any time.

What is Kafka and Zookeeper HA cluster setup?

Kafka and Zookeeper HA Cluster Setup. Kafka and Zookeeper cluster setup is something where multiple Kafka and Zookeeper are running on different VMs to provide HA. Before we get into things like what is Kafka Cluster? What is HA? We must learn about What is Kafka? Where it is used? What is Kafka? Kafka is an open-source platform.

How do I set the default namespace for zookeeper?

Set hive.server2.zookeeper.namespace to the value that you want to use as the root namespace on ZooKeeper. The default value is hiveserver2. 5. The adminstrator should ensure that the ZooKeeper service is running on the cluster, and that each HiveServer2 instance gets a unique host:port combination to bind to upon startup.

What happens when the Master of a zookeeper instance dies?

One master instance will take the role of a master and others would be in the standby mode. If the current master dies, Zookeeper will elect another standby instance as a Master, recover the older Master's state and then resume the scheduling .


2 Answers

Most probably you have added these lines directly to the file like so:

export SPARK_PREFIX=`dirname "$this"`/..
export SPARK_CONF_DIR="$SPARK_HOME/conf"
...
-Dspark.deploy.recoveryMode=ZOOKEEPER
-Dspark.deploy.zookeeper.url=localhost:2181

And when invoked by start-all.sh, bash complains that those -Dspark... are not valid commands. Note that spark_config.sh is a bash script and should contain valid bash expressions.

Following the configuration guide at High Availability, you should set SPARK_DAEMON_JAVA_OPTS with the options for: spark.deploy.recoveryMode, spark.deploy.zookeeper.url, and spark.deploy.zookeeper.dir.

Using your data, you need to add a line to spark-conf.sh like so:

export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=localhost:2181"
like image 132
maasg Avatar answered Oct 03 '22 14:10

maasg


Try adding the below line in spark_env.sh

export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=ZK1:2181,ZK2:2181,ZK3:2181 -Dspark.deploy.zookeeper.dir=/sparkha"

Please replace ZK1, ZK2 and ZK3 with your ZK quorum hosts and port and here /sparkha is the data store in ZK for spark , bu default it will be /spark Just tested , it worked for us . HTH

like image 29
Suman Banerjee Avatar answered Oct 02 '22 14:10

Suman Banerjee