I am trying to launch a spark task on a hadoop cluster using spark submit on YARN mode.
I am launching spark-submit from my development machine.
According to Running Spark On YARN docs, I am supposed to provide a path for the hadoop cluster configuration on the env var HADOOP_CONF_DIR
or YARN_CONF_DIR
. This is where it gets tricky: Why do these folders must exist on my local machine if I am sending the task to a remote YARN service? Does this mean that spark-submit must be located inside the cluster and therefore I cannot launch a spark task remotely? If not, what should I populate these folders with? Should I copy the hadoop configuration folder from the YARN cluster node where the task manager service resides?
HADOOP_CONF_DIR & spark-env.sh While running spark using Yarn, you need to add following line in to spark-env.sh Note: check $HADOOP_HOME/etc/hadoop is correct one in your environment. And spark-env.sh contains export of HADOOP_HOME as well.
In YARN mode, when accessing Hadoop file systems, aside from the default file system in the hadoop configuration, Spark will also automatically obtain delegation tokens for the service hosting the staging directory of the Spark application. The full path to the file that contains the keytab for the principal specified above.
Unlike other cluster managers supported by Spark in which the master’s address is specified in the --master parameter, in YARN mode the ResourceManager’s address is picked up from the Hadoop configuration. Thus, the --master parameter is yarn. To launch a Spark application in cluster mode:
This is because the Spark driver does not run on the YARN cluster in client mode, only the Spark executors do. The --files and --archives options support specifying file names with the # similar to Hadoop.
1) When submitting a job Spark needs to know what it is connecting to. The files are parsed and required configuration is being used to connect to Hadoop cluster. Note that in documentation they say that it is client side configuration (right in the first sentence), meaning that you actually do not need all the configurations to connect to the cluster in the file (to connect to non-secured Hadoop cluster with minimalist configuration) you will need at least the following configs present:
fs.defaultFS
(in case you intent to read from HDFS)dfs.nameservices
yarn.resourcemanager.hostname
or yarn.resourcemanager.address
yarn.application.classpath
You can avoid having files, by setting the same settings in the code of the job you are submitting:
SparkConf sparkConfiguration = new SparkConf();
sparkConfiguration.set("spark.hadoop.fs.defaultFS", "...");
...
2) Spark submit can be located on any machine, not necessarily on the cluster, as long as it knows how to connect to the cluster (you can even run the submission from Eclipse, without installing anything, but project dependencies, related to Spark).
3) You should populate the configuration folders with:
Copying those files from the server is an easiest approach to start with. After you can remove some configuration which is not required by spark-submit or may be security-sensitive.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With