Single or multiple files per mapper in hadoop?

Question

Does a mapper process multiple files at the same time or a mapper can only process a single file at a time? I want to know the default behaviour

franklinsijo · Accepted Answer

Typical Mapreduce jobs follow one input split per mapper by default.
If the file size is larger than the split size (i.e., it has more than one input split), then it is multiple mappers per file.
It is one file per mapper if the file is not splittable like a Gzip file or if the process is Distcp where file is the finest level of granularity.

Alex · Answer

If you go to the definition of FileInputFormat you will see that on the top it has three methods:

addInputPath(JobConf conf, Path path) - Add a Path to the list of inputs for the map-reduce job. So it will pick up all files in catalog but not the single one, as you say

addInputPathRecursively(List result, FileSystem fs, Path path, PathFilter inputFilter) - Add files in the input path recursively into the results.

addInputPaths(JobConf conf, String commaSeparatedPaths) - Add the given comma separated paths to the list of inputs for the map-reduce job

Operating these three methods you can easily setup any multiple input you want. Then InputSplits of your InputFormat start to spliting this data among the mapper jobs. The Map-Reduce framework relies on the InputFormat of the job to:

Validate the input-specification of the job.
Split-up the input file(s) into logical InputSplits, each of which is then assigned to an individual Mapper.
Provide the RecordReader implementation to be used to glean input records from the logical InputSplit for processing by the Mapper.

So technically single mapper will process its own part only which can contain the data from several files. But for each particular format you should look into InputSplit to understand how data will be distributed accross the mappers.

Single or multiple files per mapper in hadoop?

Tags:

hadoop

hadoop2

mapreduce

hadoop-yarn

user3396729

2 Answers

franklinsijo

Alex

Recent Activity

Donate For Us

Single or multiple files per mapper in hadoop?

Tags:

hadoop

hadoop2

mapreduce

hadoop-yarn

user3396729

2 Answers

franklinsijo

Alex

Related questions

Recent Activity

Donate For Us