Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop : Provide directory as input to MapReduce job

I'm using Cloudera Hadoop. I'm able to run simple mapreduce program where I provide a file as input to MapReduce program.

This file contains all the other files to be processed by mapper function.

But, I'm stuck at one point.

/folder1
  - file1.txt
  - file2.txt
  - file3.txt

How can I specify the input path to MapReduce program as "/folder1", so that it can start processing each file inside that directory ?

Any ideas ?

EDIT :

1) Intiailly, I provided the inputFile.txt as input to mapreduce program. It was working perfectly.

>inputFile.txt
file1.txt
file2.txt
file3.txt

2) But now, instead of giving an input file, I want to provide with an input directory as arg[0] on command line.

hadoop jar ABC.jar /folder1 /output
like image 267
Saurabh Gokhale Avatar asked Nov 20 '13 11:11

Saurabh Gokhale


People also ask

Can you provide multiple input path to MapReduce job?

We use MultipleInputs class which supports MapReduce jobs that have multiple input paths with a different InputFormat and Mapper for each path.

How do you specify more than one directory as input in the Hadoop MapReduce program?

How to specify more than one directory in the MapReduce Job? To take more than one folder as input you can simply mention separate paths while running the job. Say for example you have two files: /user/hduser/input1/a.

How Hadoop runs a MapReduce job?

MapReduce assigns fragments of data across the nodes in a Hadoop cluster. The goal is to split a dataset into chunks and use an algorithm to process those chunks at the same time. The parallel processing on multiple machines greatly increases the speed of handling even petabytes of data.

What will happen if the output directory already exists for a MapReduce job?

What will happen if the output directory already exists for a MapReduce job? The job will overwrite the files from that directory and store the output generated in the directory. The job will throw an error stating that the output directory already exists.


1 Answers

The Problem is FileInputFormat doesn't read files recursively in the input path dir.

Solution: Use Following code

FileInputFormat.setInputDirRecursive(job, true); Before below line in your Map Reduce Code

FileInputFormat.addInputPath(job, new Path(args[0]));

You can check here for which version it was fixed.

like image 147
shashaDenovo Avatar answered Nov 03 '22 22:11

shashaDenovo