Assume that there is Spark job that is going to read a file named records.txt from HDFS and do some transformations and one action(write the processed output into HDFS). The job will be submitted to YARN cluster mode Assume also that records.txt is a file of 128 MB and one of its HDFS replicated blocks is also in NODE 1 Lets say YARN is allocating is a executor inside NODE 1 . How does YARN allocates a executor exactly in a node where the input data is located? Who tells YARN that one of the replicated HDFS block of records.txt is available in NODE 1 ? How the data localilty is found By Spark Application ? Is it done by Driver which runs inside Application Master ? Does YARN know about the datalocality ?

The fundamental question here is: <blockquote> Does YARN know about the datalocality ? </blockquote> YARN "knows" what application tells it and it understand structure (topology) of the cluster. When application makes a resource request, it can include specific locality constraints, which might, or might not be satisfied, when resources are allocated. If constraints cannot be specified, YARN (or any other cluster manager) will attempt to provide best alternative match, based on its knowledge of the cluster topology. So how application "knows"? If application uses input source (file system or other), which supports some form of data locality, it can query it corresponding catalog (namenode in case of HDFS) to get locations of the blocks of data it wants to access. In broader sense Spark RDD can define <code>preferredLocations</code>, depending on a specific <code>RDD</code> implementation, which can be later translated into resource constraints, for the cluster manager (not necessarily YARN).

How YARN knows data locality in Apache spark in cluster mode

Tags:

apache-spark

hadoop-yarn

Assume that there is Spark job that is going to read a file named records.txt from HDFS and do some transformations and one action(write the processed output into HDFS). The job will be submitted to YARN cluster mode

Assume also that records.txt is a file of 128 MB and one of its HDFS replicated blocks is also in NODE 1

Lets say YARN is allocating is a executor inside NODE 1 .

How does YARN allocates a executor exactly in a node where the input data is located?

Who tells YARN that one of the replicated HDFS block of records.txt is available in NODE 1 ?

How the data localilty is found By Spark Application ? Is it done by Driver which runs inside Application Master ?

Does YARN know about the datalocality ?

920

asked Apr 20 '18 14:04

Surender Raja

1 Answers

The fundamental question here is:

Does YARN know about the datalocality ?

YARN "knows" what application tells it and it understand structure (topology) of the cluster. When application makes a resource request, it can include specific locality constraints, which might, or might not be satisfied, when resources are allocated.

If constraints cannot be specified, YARN (or any other cluster manager) will attempt to provide best alternative match, based on its knowledge of the cluster topology.

So how application "knows"?

If application uses input source (file system or other), which supports some form of data locality, it can query it corresponding catalog (namenode in case of HDFS) to get locations of the blocks of data it wants to access.

In broader sense Spark RDD can define preferredLocations, depending on a specific RDD implementation, which can be later translated into resource constraints, for the cluster manager (not necessarily YARN).

122

answered Oct 01 '22 12:10

Alper t. Turker

Related questions
                            
                                How to write streaming dataset to Kafka?
                            
                                Kafka with Spark 2.1 Structured Streaming - cannot deserialize
                            
                                I am getting an error while creating a simple RDD in Spark
                            
                                Spark Pipeline error
                            
                                spring autoconfiguration class is missing in META-INF/spring.factories
                            
                                NoClassDefFoundError: Could not initialize XXX class after deploying on spark standalone cluster
                            
                                How to cache partitioned dataset and use in multiple queries?
                            
                                Pyspark udf high memory utilization
                            
                                Enum equivalent in Spark Dataframe/Parquet
                            
                                Cumulative distinct count with Spark SQL
                            
                                pyspark.sql.utils.IllegalArgumentException: "Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuild in windows 10
                            
                                How handle categorical features in the latest Random Forest in Spark?
                            
                                Why is difference between sqlContext.read.load and sqlContext.read.text?
                            
                                Which would be a quicker (and better) tool for querying data stored in the Parquet format - Spark SQL, Athena or ElasticSearch?
                            
                                How does Serialized RDD occupy less space in memory?
                            
                                Error: Could not write class iw because it exceeds JVM code size limits. Method code too large
                            
                                Scala: How to combine two data frames?
                            
                                How to implement `except` in Apache Spark based on subset of columns?
                            
                                how to convert a timestamp into string (without changing timezone)?
                            
                                update a dataframe column with new values

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With