In my job, I have the need to parse many historical logsets. Individual customers (there are thousands) may have hundreds of log subdirectories broken out by date. For example:
Each individual log set may itself be five or six levels deep and contain thousands of files.
Therefore, I actually want the individual map jobs to handle walking the subdirectories: simply enumerating individual files is part of my distributed computing problem!
Unfortunately, when I try passing a directory containing only log subdirectories to Hadoop, it complains that I can't pass those subdirectories to my mapper. (Again, I have written to accept subdirectories as input):
$ hadoop jar "${HADOOP_HOME}/contrib/streaming/hadoop-streaming-${HADOOP_VERSION}.jar" -input file:///mnt/logs/Customer_Name/ -file mapper.sh -mapper "mapper.sh" -file reducer.sh -reducer "reducer.sh" -output .
[ . . . ]
12/04/10 12:48:35 ERROR security.UserGroupInformation: PriviledgedActionException as:cloudera (auth:SIMPLE) cause:java.io.IOException: Not a file: file:/mnt/logs/Customer_Name/2011-05-20-003 12/04/10 12:48:35 ERROR streaming.StreamJob: Error Launching job : Not a file: file:/mnt/logs/Customer_Name/2011-05-20-003 Streaming Command Failed! [cloudera@localhost ~]$
Is there a straightforward way to convince Hadoop-streaming to permit me to assign directories as work items?
Let us now see how Hadoop Streaming works. The mapper and the reducer (in the above example) are the scripts that read the input line-by-line from stdin and emit the output to stdout. The utility creates a Map/Reduce job and submits the job to an appropriate cluster and monitor the job progress until its completion.
Hadoop Real-Time Streaming processes data using the MapReduce framework.
Which is the tool of Hadoop streaming data transfer? Apache Flume – Data Transfer In Hadoop.
I guess you need to investigate writing a custom InputFormat which you can pass the root directory too, it will create a split for each customer, and then the record reader for each split will do the directory walk and push the file contents to your mappers
Hadoop supports input paths to be a regular expression. I haven't experimented with a lot of complex regex, but the simple placeholders ? and * does work.
So in your case I think if you have the following as your input path it will work :
file:///mnt/logs/Customer_Name/*/*
The last asterisk might not be needed as all the files in the final directory are automatically added as input path.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With