I have setup a cluster(YARN) using Ambari with 3 VMs as hosts.
Where I can find the value for HADOOP_CONF_DIR ?
# Run on a YARN cluster
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn-cluster \ # can also be `yarn-client` for client mode
--executor-memory 20G \
--num-executors 50 \
/path/to/examples.jar \
1000
Also, CDH cluster's HADOOP_CONF_DIR should by default be set to /etc/hadoop/conf .
You can submit a Spark batch application by using cluster mode (default) or client mode either inside the cluster or from an external client: Cluster mode (default): Submitting Spark batch application and having the driver run on a host in your driver resource group. The spark-submit syntax is --deploy-mode cluster.
If you use yarn as manager on a cluster with multiple nodes you do not need to install spark on each node. Yarn will distribute the spark binaries to the nodes when a job is submitted. Running Spark on YARN requires a binary distribution of Spark which is built with YARN support.
Install Hadoop as well. In my case I've installed it in /usr/local/hadoop
Setup Hadoop Environment Variables
export HADOOP_INSTALL=/usr/local/hadoop
Then set the conf directory
export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With