Does the Hadoop split the data based on the number of mappers set in the program? That is, having a data set of size 500MB, if the number of mappers is 200 (assuming that the Hadoop cluster allows 200 mappers simultaneously), is each mapper given 2.5 MB of data?
Besides,do all the mappers run simultaneously or some of them might get run in serial?
When Hadoop submits a job, it splits the input data logically (Input splits) and these are processed by each Mapper. The number of Mappers is equal to the number of input splits created. InputFormat. getSplits() is responsible for generating the input splits which uses each split as input for each mapper job.
What is InputSplit in Hadoop? InputSplit in Hadoop MapReduce is the logical representation of data. It describes a unit of work that contains a single map task in a MapReduce program. Hadoop InputSplit represents the data which is processed by an individual Mapper.
To solve this problem, Hadoop uses a logical representation of the data stored in file blocks, known as input splits. When a MapReduce job is assigned from the client, it calculates the total number of input splits, it understands where the first record in a block starts and where the last record in the block finishes.
How is the splitting of file invoked in Apache Hadoop? An Input File for processing is stored on local HDFS store. The InputFormat component of MapReduce task divides this file into Splits. These splits are called InputSplits in Hadoop MapReduce.
I just ran a sample MR program based on your question and here is my finding
Input: a file smaller that block size.
Case 1: Number of mapper =1 Result : 1 map task launched. Inputsplit size for each mapper(in this case only one) is same as the input file size.
Case 2: Number of mappers = 5 Result : 5 map tasks launched. Inputsplit size for each mapper is one fifth of the input file size.
Case 3: Number of mappers = 10 Result : 10 map tasks launched. Inputsplit size for each mapper is one 10th of the input file size.
So based on above, for file less then block size,
split size = total input file size / number of map task launched.
Note: But keep in mind that no. of map task is decided by based on input splits.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With