Given a job with map and reduce phases, I can see that the output folder contains files named like "part-r-00000".
If I need to post-process these files on application level, do I need to iterate over all files in output folder in natural naming order (part-r-00000, part-r-00001,part-r-00002 ...) in order to get job results?
Or I can use some hadoop helper file reader, which will allow me to get some "iterator" and handle file switching for me (when file part-r-00000 is completely read, continue from file part-r-00001)?
You can use : hdfs dfs -text /books-result/part-r-00000 | head -n 20 and it will do the work.
MapReduce default Hadoop reducer Output Format is TextOutputFormat, which writes (key, value) pairs on individual lines of text files and its keys and values can be of any type since TextOutputFormat turns them to string by calling toString() on them.
In Hadoop MapReduce job, each Reducer produces one output file with name part-r-nnnnn, where nnnnn is the sequence number of the file and is based on the number of reducers set for the job.
All inputs and outputs are stored in the HDFS. While the map is a mandatory step to filter and sort the initial data, the reduce function is optional.
You can use getmerge command of Hadoop File System(FS) shell:
hadoop fs -getmerge /mapreduce/job/output/dir/ /your/local/output/file.txt
In mapreduce you specify an output folder, the only thing it will contain will be part-r files (which is the output of a reduce task) and a _SUCCESS file (which is empty). So i think if you want to do postprocessing you only need to set the output dir of job1 as the input dir for job 2.
Now there might be some requirements for your postprocessor which can be addressed, is it for example important to process the output files in order?
Or if you just want to process the files locally then it all depends on the outputformat of your mapreduce job, this will tell you how the part-r files are structured. Then you can simple use standard i/o i guess.
You can probably use Hadoop FileSystem to do the iteration from your application of the part-r-xxxxx files.
FileSystem fs = FileSystem.get(new Configuration());
FileStatus[] status = fs.listStatus(new Path("hdfs://hostname:port/joboutputpath"));
for (int i=0;i<status.length;i++){
fs.open(status[i].getPath())));
}
You can also look into ChainMapper/ChainReducer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With