Hadoop MapReduce provide nested directories as job input

Tags:

I'm working on a job that processes a nested directory structure, containing files on multiple levels:

one/ ├── three/ │   └── four/ │       ├── baz.txt │       ├── bleh.txt │       └── foo.txt └── two/     ├── bar.txt     └── gaa.txt

When I add one/ as an input path, no files are processed, since none are immediately available at the root level.

I read about job.addInputPathRecursively(..), but this seems to have been deprecated in the more recent releases (I'm using hadoop 1.0.2). I've written some code to walk the folders and add each dir with job.addInputPath(dir), which worked until the job crashed when trying to process a directory as an input file for some reason, e.g. - trying to fs.open(split.getPath()), when split.getPath() is a directory (This happens inside LineRecordReader.java).

I'm trying to convince myself there has to be a simpler way to provide a job with a nested directory structure. Any ideas?

EDIT - apparently there's an open bug on this.

779

asked Apr 18 '12 13:04

sa125

1 Answers

I didn't found any document on this but */* works. So it's -input 'path/*/*'.

109

answered Sep 26 '22 07:09

Cheng

Related questions
                            
                                How do you resize a browser window so that the inner width is a specific value
                            
                                Date type without time in Oracle
                            
                                Share Models between 2 Rails API's (Separate Applications)
                            
                                Socket.io + PhoneGap
                            
                                How to prevent try catching every possible line in python?
                            
                                How to add a custom build type to cmake ? (targeting make)
                            
                                CSS class repetition to increase specificity
                            
                                Git change default umask when update file
                            
                                Idiomatic encapsulation in Clojure: How can data be bundled with linked behavior?
                            
                                Core data: The fetched object at index x has an out of order section name 'xxxxxx. Objects must be sorted by section name
                            
                                Magento How To Show Full Error Message Instead of Truncated One
                            
                                Which is more accurate? java.lang.Math.E or Math.exp(1.0)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With