Hadoop - get results from output files after reduce?

Tags:

mapreduce

Given a job with map and reduce phases, I can see that the output folder contains files named like "part-r-00000".

If I need to post-process these files on application level, do I need to iterate over all files in output folder in natural naming order (part-r-00000, part-r-00001,part-r-00002 ...) in order to get job results?

Or I can use some hadoop helper file reader, which will allow me to get some "iterator" and handle file switching for me (when file part-r-00000 is completely read, continue from file part-r-00001)?

494

asked Aug 26 '13 06:08

jdevelop

3 Answers

You can use getmerge command of Hadoop File System(FS) shell:

hadoop fs -getmerge /mapreduce/job/output/dir/ /your/local/output/file.txt

148

answered Oct 13 '22 13:10

GS Majumder

In mapreduce you specify an output folder, the only thing it will contain will be part-r files (which is the output of a reduce task) and a _SUCCESS file (which is empty). So i think if you want to do postprocessing you only need to set the output dir of job1 as the input dir for job 2.

Now there might be some requirements for your postprocessor which can be addressed, is it for example important to process the output files in order?

Or if you just want to process the files locally then it all depends on the outputformat of your mapreduce job, this will tell you how the part-r files are structured. Then you can simple use standard i/o i guess.

answered Oct 13 '22 15:10

DDW

You can probably use Hadoop FileSystem to do the iteration from your application of the part-r-xxxxx files.

FileSystem fs = FileSystem.get(new Configuration());
FileStatus[] status = fs.listStatus(new Path("hdfs://hostname:port/joboutputpath"));
for (int i=0;i<status.length;i++){
    fs.open(status[i].getPath())));
}

You can also look into ChainMapper/ChainReducer.

answered Oct 13 '22 15:10

SSaikia_JtheRocker

Related questions
                            
                                Hadoop Java Error : Exception in thread "main" java.lang.NoClassDefFoundError: WordCount (wrong name: org/myorg/WordCount)
                            
                                Copy files to local from multiple directories in HDFS for last 24 hours
                            
                                Exception in thread "main" java.lang.ClassNotFoundException: WordCount
                            
                                Opening a file stored in HDFS to edit in VI
                            
                                Convert YYYYMMDD String to Date in Impala
                            
                                What is the purpose of the org.apache.hadoop.mapreduce.Mapper.run() function in Hadoop?
                            
                                MapReduce job Output sort order
                            
                                hadoop - Connection refused on namenode
                            
                                read files recursively from sub directories with spark from s3 or local filesystem
                            
                                S3 parallel read and write performance?
                            
                                How to run HBase shell against a remote cluster
                            
                                How can I load Avros in Spark using the schema on-board the Avro file(s)?
                            
                                How to specify column list in hive insert into query
                            
                                How to convert a Hadoop Path object into a Java File object
                            
                                file path in hdfs
                            
                                HDFS access from remote host through Java API, user authentication
                            
                                How to use sqoop to export the default hive delimited output?
                            
                                Wrong result for count(*) in hive table
                            
                                In Spark is counting the records in an RDD expensive task?
                            
                                Setting permissions for cloudera hadoop

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With