Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Hadoop perform input splits?

This is a conceptual question involving Hadoop/HDFS. Lets say you have a file containing 1 billion lines. And for the sake of simplicity, lets consider that each line is of the form <k,v> where k is the offset of the line from the beginning and value is the content of the line.

Now, when we say that we want to run N map tasks, does the framework split the input file into N splits and run each map task on that split? or do we have to write a partitioning function that does the N splits and run each map task on the split generated?

All i want to know is, whether the splits are done internally or do we have to split the data manually?

More specifically, each time the map() function is called what are its Key key and Value val parameters?

Thanks, Deepak

like image 923
Deepak Avatar asked May 14 '10 02:05

Deepak


People also ask

What are input splits in Hadoop?

InputSplit represents the data to be processed by an individual Mapper . Typically, it presents a byte-oriented view on the input and is the responsibility of RecordReader of the job to process this and present a record-oriented view. See Also: InputFormat , RecordReader.

How does input split work?

InputSplit is the logical representation of data in Hadoop MapReduce. It represents the data which individual mapper processes. Thus the number of map tasks is equal to the number of InputSplits. Framework divides split into records, which mapper processes.

How does Hadoop split large files?

To solve this problem, Hadoop uses a logical representation of the data stored in file blocks, known as input splits. When a MapReduce job is assigned from the client, it calculates the total number of input splits, it understands where the first record in a block starts and where the last record in the block finishes.

How is input split size calculated in Hadoop?

Suppose there is 1GB (1024 MB) of data needs to be stored and processed by the hadoop. So, while storing the 1GB of data in HDFS, hadoop will split this data into smaller chunk of data. Consider, hadoop system has default 128 MB as split data size. Then, hadoop will store the 1 TB data into 8 blocks (1024 / 128 = 8 ).


3 Answers

The InputFormat is responsible to provide the splits.

In general, if you have n nodes, the HDFS will distribute the file over all these n nodes. If you start a job, there will be n mappers by default. Thanks to Hadoop, the mapper on a machine will process the part of the data that is stored on this node. I think this is called Rack awareness.

So to make a long story short: Upload the data in the HDFS and start a MR Job. Hadoop will care for the optimised execution.

like image 104
Peter Wippermann Avatar answered Sep 18 '22 15:09

Peter Wippermann


Files are split into HDFS blocks and the blocks are replicated. Hadoop assigns a node for a split based on data locality principle. Hadoop will try to execute the mapper on the nodes where the block resides. Because of replication, there are multiple such nodes hosting the same block.

In case the nodes are not available, Hadoop will try to pick a node that is closest to the node that hosts the data block. It could pick another node in the same rack, for example. A node may not be available for various reasons; all the map slots may be under use or the node may simply be down.

like image 21
Pranab Avatar answered Sep 20 '22 15:09

Pranab


Fortunately everything will be taken care by framework.

MapReduce data processing is driven by this concept of input splits. The number of input splits that are calculated for a specific application determines the number of mapper tasks.

The number of maps is usually driven by the number of DFS blocks in the input files.

Each of these mapper tasks is assigned, where possible, to a slave node where the input split is stored. The Resource Manager (or JobTracker, if you’re in Hadoop 1) does its best to ensure that input splits are processed locally.

If data locality can't be achieved due to input splits crossing boundaries of data nodes, some data will be transferred from one Data node to other Data node.

Assume that there is 128 MB block and last record did not fit in Block a and spreads in Block b, then data in Block b will be copied to node having Block a

Have a look at this diagram.

enter image description here

Have a look at related quesitons

About Hadoop/HDFS file splitting

How does Hadoop process records split across block boundaries?

like image 25
Ravindra babu Avatar answered Sep 20 '22 15:09

Ravindra babu