Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How recursively use a directory structure in the new Hadoop API?

My file structure is the following:

/indir/somedir1/somefile
/indir/somedir1/someotherfile...
/indir/somedir2/somefile
/indir/somedir2/someotherfile...

I now want to pass everything recursively into a MR job, and I am using the new API. So I did:

FileInputFormat.setInputPaths(job, new Path("/indir"));

But the job fails with:

Error: java.io.FileNotFoundException: Path is not a file: /indir/somedir1

I am using Hadoop 2.4 and in this post it is stated that Hadoop 2's new API does not support recursive files. But I am wondering how this can be, as I think it is the most ordinary thing in the world to throw a large nested directory structure at a Hadoop job...

So, is this intended, or is this a bug? In both ways, is there another workaround than using the old API?

like image 367
rabejens Avatar asked Oct 30 '14 08:10

rabejens


2 Answers

I found the answer myself. In the JIRA linked in the mentioned forum post, there are two comments on how it is done right:

  1. Set mapreduce.input.fileinputformat.input.dir.recursive to true (comment states mapred.input.dir.recursive but that is deprecated)
  2. Use FileInputFormat.addInputPath to specify the input directory

With these changes, it works.

like image 138
rabejens Avatar answered Nov 17 '22 19:11

rabejens


Another way to configure it is via FileInputFormat class.

FileInputFormat.setInputDirRecursive(job, true);
like image 30
shapiy Avatar answered Nov 17 '22 20:11

shapiy