Does a mapper process multiple files at the same time or a mapper can only process a single file at a time? I want to know the default behaviour
Gzip
file or if the process is Distcp
where file is the finest level of granularity.If you go to the definition of FileInputFormat you will see that on the top it has three methods:
addInputPath(JobConf conf, Path path) - Add a Path to the list of inputs for the map-reduce job. So it will pick up all files in catalog but not the single one, as you say
addInputPathRecursively(List result, FileSystem fs, Path path, PathFilter inputFilter) - Add files in the input path recursively into the results.
addInputPaths(JobConf conf, String commaSeparatedPaths) - Add the given comma separated paths to the list of inputs for the map-reduce job
Operating these three methods you can easily setup any multiple input you want. Then InputSplits of your InputFormat start to spliting this data among the mapper jobs. The Map-Reduce framework relies on the InputFormat of the job to:
Validate the input-specification of the job.
Split-up the input file(s) into logical InputSplits, each of which is then assigned to an individual Mapper.
Provide the RecordReader implementation to be used to glean input records from the logical InputSplit for processing by the Mapper.
So technically single mapper will process its own part only which can contain the data from several files. But for each particular format you should look into InputSplit to understand how data will be distributed accross the mappers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With