My file structure is the following:
/indir/somedir1/somefile
/indir/somedir1/someotherfile...
/indir/somedir2/somefile
/indir/somedir2/someotherfile...
I now want to pass everything recursively into a MR job, and I am using the new API. So I did:
FileInputFormat.setInputPaths(job, new Path("/indir"));
But the job fails with:
Error: java.io.FileNotFoundException: Path is not a file: /indir/somedir1
I am using Hadoop 2.4 and in this post it is stated that Hadoop 2's new API does not support recursive files. But I am wondering how this can be, as I think it is the most ordinary thing in the world to throw a large nested directory structure at a Hadoop job...
So, is this intended, or is this a bug? In both ways, is there another workaround than using the old API?
I found the answer myself. In the JIRA linked in the mentioned forum post, there are two comments on how it is done right:
mapreduce.input.fileinputformat.input.dir.recursive
to true
(comment states mapred.input.dir.recursive
but that is deprecated)FileInputFormat.addInputPath
to specify the input directoryWith these changes, it works.
Another way to configure it is via FileInputFormat
class.
FileInputFormat.setInputDirRecursive(job, true);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With