Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How YARN knows data locality in Apache spark in cluster mode

Assume that there is Spark job that is going to read a file named records.txt from HDFS and do some transformations and one action(write the processed output into HDFS). The job will be submitted to YARN cluster mode

Assume also that records.txt is a file of 128 MB and one of its HDFS replicated blocks is also in NODE 1

Lets say YARN is allocating is a executor inside NODE 1 .

How does YARN allocates a executor exactly in a node where the input data is located?

Who tells YARN that one of the replicated HDFS block of records.txt is available in NODE 1 ?

How the data localilty is found By Spark Application ? Is it done by Driver which runs inside Application Master ?

Does YARN know about the datalocality ?

like image 920
Surender Raja Avatar asked Apr 20 '18 14:04

Surender Raja


People also ask

Where does Spark drive in YARN cluster mode?

In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.

How does Spark work with YARN?

When running Spark on YARN, each Spark executor runs as a YARN container. Where MapReduce schedules a container and fires up a JVM for each task, Spark hosts multiple tasks within the same container. This approach enables several orders of magnitude faster task startup time.

What is difference between standalone and YARN cluster?

Spark standalone mode requires each application to run an executor on every node in the cluster, whereas with YARN, you choose the number of executors to use.

What is data locality in Spark?

Data locality in spark helps spark scheduler to run the tasks of compute or caching on the machines where the data is available. This concept came from Hadoop Map/Reduce where data in HDFS will be used to place map operation. This avoided the data movement over network in HDFS.


1 Answers

The fundamental question here is:

Does YARN know about the datalocality ?

YARN "knows" what application tells it and it understand structure (topology) of the cluster. When application makes a resource request, it can include specific locality constraints, which might, or might not be satisfied, when resources are allocated.

If constraints cannot be specified, YARN (or any other cluster manager) will attempt to provide best alternative match, based on its knowledge of the cluster topology.

So how application "knows"?

If application uses input source (file system or other), which supports some form of data locality, it can query it corresponding catalog (namenode in case of HDFS) to get locations of the blocks of data it wants to access.

In broader sense Spark RDD can define preferredLocations, depending on a specific RDD implementation, which can be later translated into resource constraints, for the cluster manager (not necessarily YARN).

like image 122
Alper t. Turker Avatar answered Oct 01 '22 12:10

Alper t. Turker