Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How the data is split in Hadoop

Does the Hadoop split the data based on the number of mappers set in the program? That is, having a data set of size 500MB, if the number of mappers is 200 (assuming that the Hadoop cluster allows 200 mappers simultaneously), is each mapper given 2.5 MB of data?

Besides,do all the mappers run simultaneously or some of them might get run in serial?

like image 884
HHH Avatar asked Jul 03 '13 22:07

HHH


People also ask

How would you split data into Hadoop?

When Hadoop submits a job, it splits the input data logically (Input splits) and these are processed by each Mapper. The number of Mappers is equal to the number of input splits created. InputFormat. getSplits() is responsible for generating the input splits which uses each split as input for each mapper job.

What are splits in Hadoop?

What is InputSplit in Hadoop? InputSplit in Hadoop MapReduce is the logical representation of data. It describes a unit of work that contains a single map task in a MapReduce program. Hadoop InputSplit represents the data which is processed by an individual Mapper.

How does Hadoop split large files?

To solve this problem, Hadoop uses a logical representation of the data stored in file blocks, known as input splits. When a MapReduce job is assigned from the client, it calculates the total number of input splits, it understands where the first record in a block starts and where the last record in the block finishes.

How is the splitting of file invoked in Hadoop framework?

How is the splitting of file invoked in Apache Hadoop? An Input File for processing is stored on local HDFS store. The InputFormat component of MapReduce task divides this file into Splits. These splits are called InputSplits in Hadoop MapReduce.


1 Answers

I just ran a sample MR program based on your question and here is my finding

Input: a file smaller that block size.

Case 1: Number of mapper =1 Result : 1 map task launched. Inputsplit size for each mapper(in this case only one) is same as the input file size.

Case 2: Number of mappers = 5 Result : 5 map tasks launched. Inputsplit size for each mapper is one fifth of the input file size.

Case 3: Number of mappers = 10 Result : 10 map tasks launched. Inputsplit size for each mapper is one 10th of the input file size.

So based on above, for file less then block size,

split size = total input file size / number of map task launched.

Note: But keep in mind that no. of map task is decided by based on input splits.

like image 111
Arijit Banerjee Avatar answered Sep 25 '22 05:09

Arijit Banerjee