I have a situation where I have multiple (100+ of 2-3 MB each) files in compressed gz format present in multiple directories. For Example
A1/B1/C1/part-0000.gz
A2/B2/C2/part-0000.gz
A1/B1/C1/part-0001.gz
I have to feed all these files into one Map job. From what I see , for using MultipleFileInputFormat all input files need to be in same directory . Is it possible to pass multiple directories directly into the job?
If not , then is it possible to efficiently put these files into one directory without naming conflict or to merge these files into 1 single compressed gz file.
Note: I am using plain java to implement the Mapper and not using Pig or hadoop streaming.
Any help regarding the above issue will be deeply appreciated.
Thanks,
Ankit
Yes, it is possible to have the output of Hadoop MapReduce Job written to multiple directories.
You don't have to pass individual file as input for MapReduce Job. FileInputFormat class already provides API to accept list of multiple files as Input to Map Reduce program. path - Path to be added to the list of inputs for the map-reduce job. Job job = Job.
If Multiple input files are present in the same directory – By default hadoop doesnt read the directory recursively. But suppose if multiple input files like data1, data2,etc are present in /folder1, then Set mapreduce. input. fileinputformat.
FileInputFormat.addInputPaths() can take a comma separated list of multiple files, like
FileInputFormat.addInputPaths("foo/file1.gz,bar/file2.gz")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With