Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pass directories not files to hadoop-streaming?

In my job, I have the need to parse many historical logsets. Individual customers (there are thousands) may have hundreds of log subdirectories broken out by date. For example:

  • logs/Customer_One/2011-01-02-001
  • logs/Customer_One/2012-02-03-001
  • logs/Customer_One/2012-02-03-002
  • logs/Customer_Two/2009-03-03-001
  • logs/Customer_Two/2009-03-03-002

Each individual log set may itself be five or six levels deep and contain thousands of files.

Therefore, I actually want the individual map jobs to handle walking the subdirectories: simply enumerating individual files is part of my distributed computing problem!

Unfortunately, when I try passing a directory containing only log subdirectories to Hadoop, it complains that I can't pass those subdirectories to my mapper. (Again, I have written to accept subdirectories as input):

$ hadoop jar "${HADOOP_HOME}/contrib/streaming/hadoop-streaming-${HADOOP_VERSION}.jar" -input file:///mnt/logs/Customer_Name/ -file mapper.sh -mapper "mapper.sh" -file reducer.sh -reducer "reducer.sh" -output .

[ . . . ]

12/04/10 12:48:35 ERROR security.UserGroupInformation: PriviledgedActionException as:cloudera (auth:SIMPLE) cause:java.io.IOException: Not a file: file:/mnt/logs/Customer_Name/2011-05-20-003 12/04/10 12:48:35 ERROR streaming.StreamJob: Error Launching job : Not a file: file:/mnt/logs/Customer_Name/2011-05-20-003 Streaming Command Failed! [cloudera@localhost ~]$

Is there a straightforward way to convince Hadoop-streaming to permit me to assign directories as work items?

like image 627
Jon Lasser Avatar asked Apr 10 '12 20:04

Jon Lasser


People also ask

What is the mapper used in Hadoop streaming?

Let us now see how Hadoop Streaming works. The mapper and the reducer (in the above example) are the scripts that read the input line-by-line from stdin and emit the output to stdout. The utility creates a Map/Reduce job and submits the job to an appropriate cluster and monitor the job progress until its completion.

Which of the below interfaces does Hadoop used to enable streaming data analysis?

Hadoop Real-Time Streaming processes data using the MapReduce framework.

Which is the tool of Hadoop streaming data transfer?

Which is the tool of Hadoop streaming data transfer? Apache Flume – Data Transfer In Hadoop.


2 Answers

I guess you need to investigate writing a custom InputFormat which you can pass the root directory too, it will create a split for each customer, and then the record reader for each split will do the directory walk and push the file contents to your mappers

like image 82
Chris White Avatar answered Sep 28 '22 16:09

Chris White


Hadoop supports input paths to be a regular expression. I haven't experimented with a lot of complex regex, but the simple placeholders ? and * does work.

So in your case I think if you have the following as your input path it will work :

file:///mnt/logs/Customer_Name/*/*

The last asterisk might not be needed as all the files in the final directory are automatically added as input path.

like image 28
Amar Avatar answered Sep 28 '22 17:09

Amar