Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop MapReduce provide nested directories as job input

Tags:

I'm working on a job that processes a nested directory structure, containing files on multiple levels:

one/ ├── three/ │   └── four/ │       ├── baz.txt │       ├── bleh.txt │       └── foo.txt └── two/     ├── bar.txt     └── gaa.txt 

When I add one/ as an input path, no files are processed, since none are immediately available at the root level.

I read about job.addInputPathRecursively(..), but this seems to have been deprecated in the more recent releases (I'm using hadoop 1.0.2). I've written some code to walk the folders and add each dir with job.addInputPath(dir), which worked until the job crashed when trying to process a directory as an input file for some reason, e.g. - trying to fs.open(split.getPath()), when split.getPath() is a directory (This happens inside LineRecordReader.java).

I'm trying to convince myself there has to be a simpler way to provide a job with a nested directory structure. Any ideas?

EDIT - apparently there's an open bug on this.

like image 779
sa125 Avatar asked Apr 18 '12 13:04

sa125


People also ask

Can you provide multiple input path to MapReduce jobs?

We use MultipleInputs class which supports MapReduce jobs that have multiple input paths with a different InputFormat and Mapper for each path.

Can we have Hadoop job output in multiple directories?

Yes, it is possible to have the output of Hadoop MapReduce Job written to multiple directories.

How do you pass multiple input files to a MapReduce job?

Here, we are also trying to pass multiple file to a map reduce job (files from multiple domains). For this we can simply edit a java code and add few lines into it for multiple inputs to work. Path HdpPath = new Path(args[0]); Path ClouderaPath = new Path(args[1]); Path outputPath = new Path(args[2]); MultipleInputs.

What is MapReduce job in Hadoop?

MapReduce is a programming paradigm that enables massive scalability across hundreds or thousands of servers in a Hadoop cluster. As the processing component, MapReduce is the heart of Apache Hadoop. The term "MapReduce" refers to two separate and distinct tasks that Hadoop programs perform.


1 Answers

I didn't found any document on this but */* works. So it's -input 'path/*/*'.

like image 109
Cheng Avatar answered Sep 26 '22 07:09

Cheng