Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark submit YARN mode HADOOP_CONF_DIR contents

I am trying to launch a spark task on a hadoop cluster using spark submit on YARN mode.

I am launching spark-submit from my development machine.

According to Running Spark On YARN docs, I am supposed to provide a path for the hadoop cluster configuration on the env var HADOOP_CONF_DIR or YARN_CONF_DIR. This is where it gets tricky: Why do these folders must exist on my local machine if I am sending the task to a remote YARN service? Does this mean that spark-submit must be located inside the cluster and therefore I cannot launch a spark task remotely? If not, what should I populate these folders with? Should I copy the hadoop configuration folder from the YARN cluster node where the task manager service resides?

like image 570
NotGaeL Avatar asked Jul 20 '16 13:07

NotGaeL


People also ask

How to run Hadoop in Spark using yarn?

HADOOP_CONF_DIR & spark-env.sh While running spark using Yarn, you need to add following line in to spark-env.sh Note: check $HADOOP_HOME/etc/hadoop is correct one in your environment. And spark-env.sh contains export of HADOOP_HOME as well.

What is yarn mode in Hadoop?

In YARN mode, when accessing Hadoop file systems, aside from the default file system in the hadoop configuration, Spark will also automatically obtain delegation tokens for the service hosting the staging directory of the Spark application. The full path to the file that contains the keytab for the principal specified above.

What is yarn mode in Apache Spark?

Unlike other cluster managers supported by Spark in which the master’s address is specified in the --master parameter, in YARN mode the ResourceManager’s address is picked up from the Hadoop configuration. Thus, the --master parameter is yarn. To launch a Spark application in cluster mode:

Why doesn't the spark driver run on the yarn cluster in client mode?

This is because the Spark driver does not run on the YARN cluster in client mode, only the Spark executors do. The --files and --archives options support specifying file names with the # similar to Hadoop.


Video Answer


1 Answers

1) When submitting a job Spark needs to know what it is connecting to. The files are parsed and required configuration is being used to connect to Hadoop cluster. Note that in documentation they say that it is client side configuration (right in the first sentence), meaning that you actually do not need all the configurations to connect to the cluster in the file (to connect to non-secured Hadoop cluster with minimalist configuration) you will need at least the following configs present:

  • fs.defaultFS (in case you intent to read from HDFS)
  • dfs.nameservices
  • yarn.resourcemanager.hostname or yarn.resourcemanager.address
  • yarn.application.classpath
  • (others might be required, depending on the configuration)

You can avoid having files, by setting the same settings in the code of the job you are submitting:

SparkConf sparkConfiguration = new SparkConf();
sparkConfiguration.set("spark.hadoop.fs.defaultFS", "...");
...

2) Spark submit can be located on any machine, not necessarily on the cluster, as long as it knows how to connect to the cluster (you can even run the submission from Eclipse, without installing anything, but project dependencies, related to Spark).

3) You should populate the configuration folders with:

  • core-site.xml
  • yarn-site.xml
  • hdfs-site.xml
  • mapred-site.xml

Copying those files from the server is an easiest approach to start with. After you can remove some configuration which is not required by spark-submit or may be security-sensitive.

like image 93
Serhiy Avatar answered Oct 13 '22 01:10

Serhiy