Pass directories not files to hadoop-streaming?

Tags:

hadoop-streaming

In my job, I have the need to parse many historical logsets. Individual customers (there are thousands) may have hundreds of log subdirectories broken out by date. For example:

logs/Customer_One/2011-01-02-001
logs/Customer_One/2012-02-03-001
logs/Customer_One/2012-02-03-002
logs/Customer_Two/2009-03-03-001
logs/Customer_Two/2009-03-03-002

Each individual log set may itself be five or six levels deep and contain thousands of files.

Therefore, I actually want the individual map jobs to handle walking the subdirectories: simply enumerating individual files is part of my distributed computing problem!

Unfortunately, when I try passing a directory containing only log subdirectories to Hadoop, it complains that I can't pass those subdirectories to my mapper. (Again, I have written to accept subdirectories as input):

$ hadoop jar "${HADOOP_HOME}/contrib/streaming/hadoop-streaming-${HADOOP_VERSION}.jar" -input file:///mnt/logs/Customer_Name/ -file mapper.sh -mapper "mapper.sh" -file reducer.sh -reducer "reducer.sh" -output .

[ . . . ]

12/04/10 12:48:35 ERROR security.UserGroupInformation: PriviledgedActionException as:cloudera (auth:SIMPLE) cause:java.io.IOException: Not a file: file:/mnt/logs/Customer_Name/2011-05-20-003 12/04/10 12:48:35 ERROR streaming.StreamJob: Error Launching job : Not a file: file:/mnt/logs/Customer_Name/2011-05-20-003 Streaming Command Failed! [cloudera@localhost ~]$

Is there a straightforward way to convince Hadoop-streaming to permit me to assign directories as work items?

627

asked Apr 10 '12 20:04

Jon Lasser

2 Answers

I guess you need to investigate writing a custom InputFormat which you can pass the root directory too, it will create a split for each customer, and then the record reader for each split will do the directory walk and push the file contents to your mappers

answered Sep 28 '22 16:09

Chris White

Hadoop supports input paths to be a regular expression. I haven't experimented with a lot of complex regex, but the simple placeholders ? and * does work.

So in your case I think if you have the following as your input path it will work :

file:///mnt/logs/Customer_Name/*/*

The last asterisk might not be needed as all the files in the final directory are automatically added as input path.

answered Sep 28 '22 17:09

Amar

Related questions
                            
                                Hive query too slow and failed
                            
                                JSON object spans multiple lines, How to split input in Hadoop
                            
                                In Hadoop Map-Reduce, does any class see the whole list of keys after sorting and before partitioning?
                            
                                Chaining Multi-Reducers in a Hadoop MapReduce job
                            
                                R+Hadoop: How to read CSV file from HDFS and execute mapreduce?
                            
                                Processing images using hadoop
                            
                                hadoop/yarn and task parallelization on non-hdfs filesystems
                            
                                Error on running multiple Workflow in OOZIE-4.1.0
                            
                                JAVA_HOME error with upgrade to Spark 1.3.0
                            
                                How wordCount mapReduce jobs, run on hadoop yarn cluster with apache tez?
                            
                                Is it possible to read and write Parquet using Java without a dependency on Hadoop and HDFS?
                            
                                Loading data from RDBMS to Hadoop with multiple destinations
                            
                                Read data from remote hive on spark over JDBC returns empty result
                            
                                How to speedup my tensorflow execution on hadoop?
                            
                                Re-run Spark jobs on Failure or Abort
                            
                                Flink - No FileSystem for scheme: hdfs
                            
                                Spark and Hive in Hadoop 3: Difference between metastore.catalog.default and spark.sql.catalogImplementation
                            
                                When was the first version of Hadoop released? [closed]
                            
                                How does one implement a Hadoop Mapper in Scala 2.9.0?
                            
                                hbase.MasterNotRunningException while creating table in Hbase

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With