Spark submit YARN mode HADOOP_CONF_DIR contents

Tags:

I am trying to launch a spark task on a hadoop cluster using spark submit on YARN mode.

I am launching spark-submit from my development machine.

According to Running Spark On YARN docs, I am supposed to provide a path for the hadoop cluster configuration on the env var HADOOP_CONF_DIR or YARN_CONF_DIR. This is where it gets tricky: Why do these folders must exist on my local machine if I am sending the task to a remote YARN service? Does this mean that spark-submit must be located inside the cluster and therefore I cannot launch a spark task remotely? If not, what should I populate these folders with? Should I copy the hadoop configuration folder from the YARN cluster node where the task manager service resides?

570

asked Jul 20 '16 13:07

NotGaeL

Video Answer

1 Answers

1) When submitting a job Spark needs to know what it is connecting to. The files are parsed and required configuration is being used to connect to Hadoop cluster. Note that in documentation they say that it is client side configuration (right in the first sentence), meaning that you actually do not need all the configurations to connect to the cluster in the file (to connect to non-secured Hadoop cluster with minimalist configuration) you will need at least the following configs present:

fs.defaultFS (in case you intent to read from HDFS)
dfs.nameservices
yarn.resourcemanager.hostname or yarn.resourcemanager.address
yarn.application.classpath
(others might be required, depending on the configuration)

You can avoid having files, by setting the same settings in the code of the job you are submitting:

SparkConf sparkConfiguration = new SparkConf();
sparkConfiguration.set("spark.hadoop.fs.defaultFS", "...");
...

2) Spark submit can be located on any machine, not necessarily on the cluster, as long as it knows how to connect to the cluster (you can even run the submission from Eclipse, without installing anything, but project dependencies, related to Spark).

3) You should populate the configuration folders with:

core-site.xml
yarn-site.xml
hdfs-site.xml
mapred-site.xml

Copying those files from the server is an easiest approach to start with. After you can remove some configuration which is not required by spark-submit or may be security-sensitive.

answered Oct 13 '22 01:10

Serhiy

Related questions
                            
                                Hive doesn't work on install
                            
                                Changing user in python
                            
                                Amazon EC2 vs PiCloud [closed]
                            
                                hadoop hdfs points to file:/// not hdfs://
                            
                                error in hive metadata: org.apache.thrift.transport.TTransportException: java.net
                            
                                Deleting jobs from oozie's web UI?
                            
                                Accessing files in HDFS using Java
                            
                                Hadoop Pig count number
                            
                                HDFS error: target already exists
                            
                                Hive is not showing tables
                            
                                Data visualisation tools availble on hive hadoop
                            
                                Create HIVE partitioned table HDFS location assistance
                            
                                How to rename huge amount of files in Hadoop/Spark?
                            
                                HDInsight: HBase or Azure Table Storage?
                            
                                Spark on embedded mode - user/hive/warehouse not found
                            
                                What happens if an RDD can't fit into memory in Spark? [duplicate]
                            
                                spark returns error libsnappyjava.so: failed to map segment from shared object: Operation not permitted
                            
                                Can you copy straight from Parquet/S3 to Redshift using Spark SQL/Hive/Presto?
                            
                                Hive: Best way to do incremetal updates on a main table
                            
                                start-all.sh, start-dfs.sh command not found

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark submit YARN mode HADOOP_CONF_DIR contents

Tags:

apache-spark

hadoop

hadoop-yarn

NotGaeL

People also ask

Video Answer

1 Answers

Serhiy

Recent Activity

Donate For Us