Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the default size that each Hadoop mapper will read?

Is it the block size of 64 MB for HDFS? Is there any configuration parameter that I can use to change it?

For a mapper reading gzip files, is it true that the number of gzip files must be equal to the number of mappers?

like image 742
Shawn Avatar asked Dec 26 '22 23:12

Shawn


1 Answers

This is dependent on your:

  • Input format - some input formats (NLineInputFormat, WholeFileInputFormat) work on boundaries other than the block size. In general though anything extended from FileInputFormat will use the block boundaries as guides
  • File block size - the individual files don't need to have the same block size as the default blocks size. This is set when the file is uploaded into HDFS - if not explicitly set, then the default block size (at the time of upload) is applied. Any changes to the default / system block size after the file is will have no effect in the already uploaded file.
  • The two FileInputFormat configuration properties mapred.min.split.size and mapred.max.split.size usually default to 1 and Long.MAX_VALUE, but if this is overridden in your system configuration, or in your job, then this will change the amunt of data processed by each mapper, and the number of mapper tasks spawned.
  • Non-splittable compression - such as gzip, cannot be processed by more than a single mapper, so you'll get 1 mapper per gzip file (unless you're using something like CombineFileInputFormat, CompositeInputFormat)

So if you have file with a block size of 64m, but either want to process more or less than this per map task, then you should just be able to set the following job configuration properties:

  • mapred.min.split.size - larger than the default, if you want to use less mappers, at the expense of (potentially) losing data locality (all data processed by a single map task may now be on 2 or more data nodes)
  • mapred.max.split.size - smaller than default, if you want to use more mappers (say you have a CPU intensive mapper) to process each file

If you're using MR2 / YARN then the above properties are deprecated and replaced by:

  • mapreduce.input.fileinputformat.split.minsize
  • mapreduce.input.fileinputformat.split.maxsize
like image 133
Chris White Avatar answered Jan 05 '23 15:01

Chris White