Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop, MapReduce - Multiple Input/Output Paths

In my input file when making the Jar for my MapReduce Job, I am using the Hadoop-local command. I wanted to know whether there was a way of, instead of specifically specifying the path for each file in my input folder to be used in the MapReduce job, whether I could just specify and pass all the files from my input folder. This is because the contents and number of files could change due to the nature of the MapReduce job I am trying to configure and as I do not know the specific amount of files, apart from just the contents of these files, is there a way to pass all files from the input folder into my MapReduce program and then iterate over each file to compute a certain function which would then send the results to the Reducer. I am only using one Map/Reduce program and I am coding in Java. I am able to use the hadoop-moonshot command, but I am working with hadoop-local at the moment.

Thanks.

like image 419
Shah.1 Avatar asked May 14 '16 17:05

Shah.1


People also ask

Can you provide multiple input paths to MapReduce jobs?

We use MultipleInputs class which supports MapReduce jobs that have multiple input paths with a different InputFormat and Mapper for each path.

Is it possible to provide multiple inputs to Hadoop?

If Multiple input files are present in the same directory – By default hadoop doesnt read the directory recursively. But suppose if multiple input files like data1, data2,etc are present in /folder1, then Set mapreduce. input. fileinputformat.

Can you assign different mappers to different input paths?

Each mapper class work on a different set of inputs, but they would all emit key-value pairs consumed by the same reducer.


1 Answers

You don't have to pass individual file as input for MapReduce Job.

FileInputFormat class already provides API to accept list of multiple files as Input to Map Reduce program.

public static void setInputPaths(Job job,
                 Path... inputPaths)
                          throws IOException

Add a Path to the list of inputs for the map-reduce job. Parameters:

conf - The configuration of the job

path - Path to be added to the list of inputs for the map-reduce job.

Example code from Apache tutorial

Job job = Job.getInstance(conf, "word count");
FileInputFormat.addInputPath(job, new Path(args[0]));

MultipleInputs provides below APIs.

public static void addInputPath(Job job,
                Path path,
                Class<? extends InputFormat> inputFormatClass,
                Class<? extends Mapper> mapperClass)

Add a Path with a custom InputFormat and Mapper to the list of inputs for the map-reduce job.

Related SE question:

Can hadoop take input from multiple directories and files

Refer to MultipleOutputs API regarding your second query on multiple output paths.

FileOutputFormat.setOutputPath(job, outDir);

// Defines additional single text based output 'text' for the job
MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class,
LongWritable.class, Text.class);

// Defines additional sequence-file based output 'sequence' for the job
MultipleOutputs.addNamedOutput(job, "seq",
SequenceFileOutputFormat.class,
LongWritable.class, Text.class);

Have a look at related SE questions regarding multiple output files.

Writing to multiple folders in hadoop?

hadoop method to send output to multiple directories

like image 172
Ravindra babu Avatar answered Oct 13 '22 07:10

Ravindra babu