Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can Hadoop tasks run in parallel on single node

I am new to hadoop and I have following questions on the same.

This is what I have understood in hadoop.

1) When ever any file is written in hadoop it is stored across all the data nodes in chunks (64MB default)

2) When we run the MR job, a split will be created from this block and on each data node the split will be processed.

3) From each split record reader will be used to generate key/value pair at mapper side.

Questions :

1) Can one data node process more than one split at a time ? What if data node capacity is more?

I think this was limitation in MR1, and with MR2 YARN we have better resource utilization.

2) Will a split be read in serial fashion at data node or can it be processed in parallel to generate key/value pair? [ By randomly accessing disk location in data node split]

3) What is 'slot' terminology in map/reduce architecture? I was reading through one of the blogs and it says YARN will provide better slot utilization in Datanode.

like image 770
user1927808 Avatar asked Feb 13 '23 15:02

user1927808


1 Answers

Let me first address the what I have understood in hadoop part.

  1. A file stored on Hadoop file system is NOT stored across all data nodes. Yes, it is split into chunks (default is 64MB), but the number of DataNodes on which these chunks are stored depends on a.File Size b.Current Load on Data Nodes c.Replication Factor and d.Physical Proximity. The NameNode takes these factors into account when deciding which dataNodes will store the chunks of a file.

  2. Again each Data Node MAY NOT Process a split. Firstly, DataNodes are only responsible for managing the storage of data, not executing jobs/tasks. The TaskTracker is the slave node responsible for executing tasks on individual nodes. Secondly, only those nodes which contain the data required for that particular Job will process the splits, unless the load on these nodes is too high, in which case the data in the split is copied to another node and processed there.

Now coming to the questions,

  1. Again, dataNodes are not responsible for processing jobs/tasks. We usually refer to a combination of dataNode + taskTracker as a node since they are commonly found on the same node, handling different responsibilities (data storage & running tasks). A given node can process more than one split at a time. Usually a single split is assigned to a single Map task. This translates to multiple Map tasks running on a single node, which is possible.

  2. Data from input file is read in serial fashion.

  3. A node's processing capacity is defined by the number of Slots. If a node has 10 slots, it means it can process 10 tasks in parallel (these tasks may be Map/Reduce tasks). The cluster administrator usually configures the number of slots per each node considering the physical configuration of that node, such as memory, physical storage, number of processor cores, etc.

like image 78
Chaos Avatar answered Feb 20 '23 18:02

Chaos