I'm working on a job that processes a nested directory structure, containing files on multiple levels:
one/ ├── three/ │ └── four/ │ ├── baz.txt │ ├── bleh.txt │ └── foo.txt └── two/ ├── bar.txt └── gaa.txt
When I add one/
as an input path, no files are processed, since none are immediately available at the root level.
I read about job.addInputPathRecursively(..)
, but this seems to have been deprecated in the more recent releases (I'm using hadoop 1.0.2). I've written some code to walk the folders and add each dir with job.addInputPath(dir)
, which worked until the job crashed when trying to process a directory as an input file for some reason, e.g. - trying to fs.open(split.getPath())
, when split.getPath()
is a directory (This happens inside LineRecordReader.java
).
I'm trying to convince myself there has to be a simpler way to provide a job with a nested directory structure. Any ideas?
EDIT - apparently there's an open bug on this.
We use MultipleInputs class which supports MapReduce jobs that have multiple input paths with a different InputFormat and Mapper for each path.
Yes, it is possible to have the output of Hadoop MapReduce Job written to multiple directories.
Here, we are also trying to pass multiple file to a map reduce job (files from multiple domains). For this we can simply edit a java code and add few lines into it for multiple inputs to work. Path HdpPath = new Path(args[0]); Path ClouderaPath = new Path(args[1]); Path outputPath = new Path(args[2]); MultipleInputs.
MapReduce is a programming paradigm that enables massive scalability across hundreds or thousands of servers in a Hadoop cluster. As the processing component, MapReduce is the heart of Apache Hadoop. The term "MapReduce" refers to two separate and distinct tasks that Hadoop programs perform.
I didn't found any document on this but */*
works. So it's -input 'path/*/*'
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With