Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Behavior of the parameter "mapred.min.split.size" in HDFS

Tags:

hadoop

hdfs

The parameter "mapred.min.split.size" changes the size of the block in which the file was written earlier? Assuming a situation where I, when starting my JOB, pass the parameter "mapred.min.split.size" with a value of 134217728 (128MB). What is correct to say about what happens?

1 - Each MAP process the equivalent of 2 HDFS blocks (assuming each block 64MB);

2 - There will be a new division of my input file (previously included HDFS) to occupy blocks in HDFS 128M;

like image 748
Alexandre Avatar asked Oct 04 '13 18:10

Alexandre


People also ask

What is Mapred min split size?

min. split. size is set to 128 MB. The size of InputSplit will be 128 MB even though the block size is 64 MB. It is nor recommended to have the split size to be greater than the block size.

What is the default split size in Hadoop?

By default, split size is approximately equal to HDFS Block size (128 MB). In MapReduce program, Inputsplit is user defined. Thus the user can control split size based on the size of data.

How is input split size calculated in Hadoop?

So, while storing the 1GB of data in HDFS, hadoop will split this data into smaller chunk of data. Consider, hadoop system has default 128 MB as split data size. Then, hadoop will store the 1 TB data into 8 blocks (1024 / 128 = 8 ). So, for each processing of this 8 blocks i.e 1 TB of data , 8 mappers are required.


2 Answers

The split size is calculated by the formula:-

max(mapred.min.split.size, min(mapred.max.split.size, dfs.block.size))

In your case it will be:-

split size=max(128,min(Long.MAX_VALUE(default),64))

So above inference:-

  1. each map will process 2 hdfs blocks(assuming each block 64MB): True

  2. There will be a new division of my input file (previously included HDFS) to occupy blocks in HDFS 128M: False

but making the minimum split size greater than the block size increases the split size, but at the cost of locality.

like image 69
Ankit Singhal Avatar answered Sep 30 '22 03:09

Ankit Singhal


Assume that the minimum split size is defined 128mb and the minimum block size is defined 64mb.

NOTE: As each block will be replicated to 3 different datanodes by HDFS by default. Also each map task performs its operation on single block.

Hence, the 128mb split size will consider 2 blocks as a single block and create a single map task for it that will run on a single datanode. This happens at the cost of data-locality. By "cost of data-locality" I am talking about the block that is residing on the datanode on which the map task is not running. Which has to be fetched from that datanode and processed on the datanode on which the map task is running, resulting in lower performance.

However if we consider a file of size 128mb, with default block size of 64mb and a default minimum split size of 64mb, then in that case as normally happens two map tasks will be created for each 64mb of block.

like image 22
Baban Gaigole Avatar answered Sep 30 '22 02:09

Baban Gaigole