In my input file when making the Jar for my MapReduce Job, I am using the Hadoop-local command. I wanted to know whether there was a way of, instead of specifically specifying the path for each file in my input folder to be used in the MapReduce job, whether I could just specify and pass all the files from my input folder. This is because the contents and number of files could change due to the nature of the MapReduce job I am trying to configure and as I do not know the specific amount of files, apart from just the contents of these files, is there a way to pass all files from the input folder into my MapReduce program and then iterate over each file to compute a certain function which would then send the results to the Reducer. I am only using one Map/Reduce program and I am coding in Java. I am able to use the hadoop-moonshot command, but I am working with hadoop-local at the moment.
Thanks.
We use MultipleInputs class which supports MapReduce jobs that have multiple input paths with a different InputFormat and Mapper for each path.
If Multiple input files are present in the same directory – By default hadoop doesnt read the directory recursively. But suppose if multiple input files like data1, data2,etc are present in /folder1, then Set mapreduce. input. fileinputformat.
Each mapper class work on a different set of inputs, but they would all emit key-value pairs consumed by the same reducer.
You don't have to pass individual file as input for MapReduce
Job.
FileInputFormat class already provides API to accept list of multiple files as Input to Map Reduce program.
public static void setInputPaths(Job job,
Path... inputPaths)
throws IOException
Add a Path to the list of inputs for the map-reduce job. Parameters:
conf - The configuration of the job
path - Path to be added to the list of inputs for the map-reduce job.
Example code from Apache tutorial
Job job = Job.getInstance(conf, "word count");
FileInputFormat.addInputPath(job, new Path(args[0]));
MultipleInputs provides below APIs.
public static void addInputPath(Job job,
Path path,
Class<? extends InputFormat> inputFormatClass,
Class<? extends Mapper> mapperClass)
Add a Path with a custom InputFormat and Mapper to the list of inputs for the map-reduce job.
Related SE question:
Can hadoop take input from multiple directories and files
Refer to MultipleOutputs API regarding your second query on multiple output paths.
FileOutputFormat.setOutputPath(job, outDir);
// Defines additional single text based output 'text' for the job
MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class,
LongWritable.class, Text.class);
// Defines additional sequence-file based output 'sequence' for the job
MultipleOutputs.addNamedOutput(job, "seq",
SequenceFileOutputFormat.class,
LongWritable.class, Text.class);
Have a look at related SE questions regarding multiple output files.
Writing to multiple folders in hadoop?
hadoop method to send output to multiple directories
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With