How does MapReduce read from multiple input files?

Question

I am developing a code to read data and write it into HDFS using mapreduce. However when I have multiple files I don't understand how it is processed . The input path to the mapper is the name of the directory as evident from the output of

String filename = conf1.get("map.input.file");

So how does it process the files in the directory ?

String filename = conf1.get("map.input.file");

So how does it process the files in the directory ?

Amar · Accepted Answer

In order to get the input file path you can use the context object, like this:

FileSplit fileSplit = (FileSplit) context.getInputSplit();
String inputFilePath = fileSplit.getPath().toString();

And as for how it multiple files are processed:

Several instances of the mapper function are created on the different machines in the cluster. Each instance receives a different input file. If files are bigger than the default dfs block size(128 MB) then files are further split into smaller parts and are then distributed to mappers.

So you can configure the input size being received by each mapper by following 2 ways:

change the HDFS block size (eg dfs.block.size=1048576)
set the paramaeter mapred.min.split.size (this can be only set to larger than HDFS block size)

Note: These parameters will only be effective if your input format supports splitting the input files. Common compression codecs (such as gzip) don't support splitting the files, so these will be ignored.

How does MapReduce read from multiple input files?

Tags:

hadoop

mapreduce

emr

amazon-emr

RadAl

1 Answers

Amar

Recent Activity

Donate For Us

How does MapReduce read from multiple input files?

Tags:

hadoop

mapreduce

emr

amazon-emr

RadAl

1 Answers

Amar

Related questions

Recent Activity

Donate For Us