Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Single or multiple files per mapper in hadoop?

Does a mapper process multiple files at the same time or a mapper can only process a single file at a time? I want to know the default behaviour

like image 297
user3396729 Avatar asked Mar 09 '23 18:03

user3396729


2 Answers

  • Typical Mapreduce jobs follow one input split per mapper by default.
  • If the file size is larger than the split size (i.e., it has more than one input split), then it is multiple mappers per file.
  • It is one file per mapper if the file is not splittable like a Gzip file or if the process is Distcp where file is the finest level of granularity.
like image 151
franklinsijo Avatar answered Mar 12 '23 07:03

franklinsijo


If you go to the definition of FileInputFormat you will see that on the top it has three methods:

addInputPath(JobConf conf, Path path) - Add a Path to the list of inputs for the map-reduce job. So it will pick up all files in catalog but not the single one, as you say

addInputPathRecursively(List result, FileSystem fs, Path path, PathFilter inputFilter) - Add files in the input path recursively into the results.

addInputPaths(JobConf conf, String commaSeparatedPaths) - Add the given comma separated paths to the list of inputs for the map-reduce job

Operating these three methods you can easily setup any multiple input you want. Then InputSplits of your InputFormat start to spliting this data among the mapper jobs. The Map-Reduce framework relies on the InputFormat of the job to:

  • Validate the input-specification of the job.

  • Split-up the input file(s) into logical InputSplits, each of which is then assigned to an individual Mapper.

  • Provide the RecordReader implementation to be used to glean input records from the logical InputSplit for processing by the Mapper.

So technically single mapper will process its own part only which can contain the data from several files. But for each particular format you should look into InputSplit to understand how data will be distributed accross the mappers.

like image 23
Alex Avatar answered Mar 12 '23 06:03

Alex