Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop job taking input files from multiple directories

Tags:

file

input

hadoop


I have a situation where I have multiple (100+ of 2-3 MB each) files in compressed gz format present in multiple directories. For Example
A1/B1/C1/part-0000.gz
A2/B2/C2/part-0000.gz
A1/B1/C1/part-0001.gz

I have to feed all these files into one Map job. From what I see , for using MultipleFileInputFormat all input files need to be in same directory . Is it possible to pass multiple directories directly into the job?
If not , then is it possible to efficiently put these files into one directory without naming conflict or to merge these files into 1 single compressed gz file.
Note: I am using plain java to implement the Mapper and not using Pig or hadoop streaming.

Any help regarding the above issue will be deeply appreciated.
Thanks,
Ankit

like image 382
Ankit Avatar asked Jan 04 '11 11:01

Ankit


People also ask

Can we have Hadoop job output in multiple directories?

Yes, it is possible to have the output of Hadoop MapReduce Job written to multiple directories.

Can you provide multiple input path to MapReduce job?

You don't have to pass individual file as input for MapReduce Job. FileInputFormat class already provides API to accept list of multiple files as Input to Map Reduce program. path - Path to be added to the list of inputs for the map-reduce job. Job job = Job.

Is it possible to provide multiple inputs to Hadoop?

If Multiple input files are present in the same directory – By default hadoop doesnt read the directory recursively. But suppose if multiple input files like data1, data2,etc are present in /folder1, then Set mapreduce. input. fileinputformat.


1 Answers

FileInputFormat.addInputPaths() can take a comma separated list of multiple files, like

FileInputFormat.addInputPaths("foo/file1.gz,bar/file2.gz")
like image 131
bajafresh4life Avatar answered Oct 12 '22 06:10

bajafresh4life