Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does container/resource allocation mean in Hadoop and in Spark when running on Yarn?

As spark runs in-memory what does resource allocation mean in Spark when running on yarn and how does it contrast with hadoop's container allocation? Just curious to know as hadoop's data and computations are on the disk where as Spark is in-memory.

like image 208
spark_dream Avatar asked May 03 '16 18:05

spark_dream


People also ask

What is container in Hadoop YARN?

Container represents an allocated resource in the cluster. The ResourceManager is the sole authority to allocate any Container to applications. The allocated Container is always on a single node and has a unique ContainerId . It has a specific amount of Resource allocated.

How resources are allocated in YARN?

You can manage your cluster capacity using the Capacity Scheduler in YARN. You can use use the Capacity Scheduler's DefaultResourceCalculator or the DominantResourceCalculator to allocate available resources. The fundamental unit of scheduling in YARN is the queue.

What is running containers in YARN?

Yarn container are a process space where a given task in isolation using resources from resources pool. It's the authority of the resource manager to assign any container to applications. The assign container has a unique customerID and is always on a single node.

Which of the following is true of running a Spark application on Hadoop YARN?

which of the following is true about Spark application running on a Hadoop YARN? The client mode and cluster mode are the two deploy modes that can be used to launch Spark applications on YARN.


1 Answers

Hadoop is a framework capable of processing large data. It has two layers. One is a distributed file system layer called HDFS and the second one is the distributed processing layer. In hadoop 2.x, the processing layer is architectured in a generic way so that it can be used for non-mapreduce applications also. For doing any process, we need system resouces such as memory, network, disk and cpu. The term container came in hadoop 2.x. In hadoop 1.x, the equivalent term was slot. A container is an allocation or share of memory and cpu. YARN is a general resource management framework which enables efficient utilization of the resources in the cluster nodes by proper allocation and sharing.

In-memory process means, the data will be completely loaded into memory and processed without writing the intermediate data to the disk. This operation will be faster as the computation happens in memory without much disk I/O operations. But this needs more memory because the entire data will be loaded into the memory.

Batch process means the data will be taken and processed in batches, intermediate results will be stored in the disk and again supplied to the next process. This also needs memory and cpu for processing, but it will be less as compared to that of fully in-memory processing systems.

YARN's resource manager act as the central resource allocator for applications such as mapreduce, impala (with llama), spark (in yarn mode) etc. So when we trigger a job, it will request the resource manager for the resources required for execution. The resource manager will allocate resources based on the availability. The resources will be allocated in the form of containers. Container is just an allocation of memory and cpu. One job may need multiple containers. Containers will be allocated across the cluster depending upon the availability. The tasks will be executed inside the container.

For example, When we submit a mapreduce job, an MR application master will be launched and it will negotiate with the resource manager for additional resources. Map and reduce tasks will be spawned in the allocated resources.

Similarly when we submit a spark job (YARN mode), a spark application master will be launched and it will negotiate with the resource manager for additional resources. The RDD's will be spawned in the allocated resources.

like image 121
Amal G Jose Avatar answered Sep 28 '22 17:09

Amal G Jose