Spark on yarn concept understanding

2 Answers

Adding to other answers.

Is it necessary that spark is installed on all the nodes in the yarn cluster?

No, If the spark job is scheduling in YARN(either client or cluster mode). Spark installation is needed in many nodes only for standalone mode.

These are the visualizations of spark app deployment modes.

Spark Standalone Cluster

Spark standalone mode

In cluster mode driver will be sitting in one of the Spark Worker node whereas in client mode it will be within the machine which launched the job.

YARN cluster mode

YARN client mode

This table offers a concise list of differences between these modes:

differences among Standalone, YARN Cluster and YARN Client modes

pics source

It says in the documentation "Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client-side) configuration files for the Hadoop cluster". Why does the client node have to install Hadoop when it is sending the job to cluster?

Hadoop installation is not mandatory but configurations(not all) are!. We can call them Gateway nodes. It's for two main reasons.

The configuration contained in HADOOP_CONF_DIR directory will be distributed to the YARN cluster so that all containers used by the application use the same configuration.
In YARN mode the ResourceManager’s address is picked up from the Hadoop configuration(yarn-default.xml). Thus, the --master parameter is yarn.

Update: (2017-01-04)

Spark 2.0+ no longer requires a fat assembly jar for production deployment. source

answered Oct 06 '22 02:10

mrsrinivas

We are running spark jobs on YARN (we use HDP 2.2).

We don't have spark installed on the cluster. We only added the Spark assembly jar to the HDFS.

For example to run the Pi example:

./bin/spark-submit \   --verbose \   --class org.apache.spark.examples.SparkPi \   --master yarn-cluster \   --conf spark.yarn.jar=hdfs://master:8020/spark/spark-assembly-1.3.1-hadoop2.6.0.jar \   --num-executors 2 \   --driver-memory 512m \   --executor-memory 512m \   --executor-cores 4 \   hdfs://master:8020/spark/spark-examples-1.3.1-hadoop2.6.0.jar 100

--conf spark.yarn.jar=hdfs://master:8020/spark/spark-assembly-1.3.1-hadoop2.6.0.jar - This config tell the yarn from were to take the spark assembly. If you don't use it, it will upload the jar from were you run spark-submit.

About your second question: The client node doesn't not need Hadoop installed. It only needs the configuration files. You can copy the directory from your cluster to your client.

answered Oct 06 '22 01:10

RanP

Related questions
                            
                                HBase REST Filter ( SingleColumnValueFilter )
                            
                                Why isn't Hadoop implemented using MPI?
                            
                                How do you make a HIVE table out of JSON data?
                            
                                Download large data for Hadoop [closed]
                            
                                What is the relationship between Spark, Hadoop and Cassandra
                            
                                Cannot Read a file from HDFS using Spark
                            
                                How to choose between Cassandra, Membase, Hadoop, MongoDB, RDBMS etc.? [closed]
                            
                                How do I get schema / column names from parquet file?
                            
                                How does Hadoop perform input splits?
                            
                                Why do we need ZooKeeper in the Hadoop stack?
                            
                                Ports are not available: listen tcp 0.0.0.0/50070: bind: An attempt was made to access a socket in a way forbidden by its access permissions
                            
                                SparkSQL vs Hive on Spark - Difference and pros and cons?
                            
                                Why spark-shell fails with NullPointerException?
                            
                                Thrift, Avro, Protocolbuffers - Are they all dead?
                            
                                Setting the number of map tasks and reduce tasks
                            
                                How to get started with Big Data Analysis [closed]
                            
                                Free Large datasets to experiment with Hadoop
                            
                                Datanode process not running in Hadoop
                            
                                Datanode not starts correctly
                            
                                Cascading examples failed to compile?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark on yarn concept understanding

Tags:

apache-spark

hadoop

hadoop-yarn

hdfs

Sporty

People also ask