Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop - get results from output files after reduce?

Given a job with map and reduce phases, I can see that the output folder contains files named like "part-r-00000".

If I need to post-process these files on application level, do I need to iterate over all files in output folder in natural naming order (part-r-00000, part-r-00001,part-r-00002 ...) in order to get job results?

Or I can use some hadoop helper file reader, which will allow me to get some "iterator" and handle file switching for me (when file part-r-00000 is completely read, continue from file part-r-00001)?

like image 494
jdevelop Avatar asked Aug 26 '13 06:08

jdevelop


People also ask

How can I see Hadoop output?

You can use : hdfs dfs -text /books-result/part-r-00000 | head -n 20 and it will do the work.

In what form is reducer output presented?

MapReduce default Hadoop reducer Output Format is TextOutputFormat, which writes (key, value) pairs on individual lines of text files and its keys and values can be of any type since TextOutputFormat turns them to string by calling toString() on them.

How many output files are created when a MapReduce job is run?

In Hadoop MapReduce job, each Reducer produces one output file with name part-r-nnnnn, where nnnnn is the sequence number of the file and is based on the number of reducers set for the job.

Where is MapReduce output stored?

All inputs and outputs are stored in the HDFS. While the map is a mandatory step to filter and sort the initial data, the reduce function is optional.


3 Answers

You can use getmerge command of Hadoop File System(FS) shell:

hadoop fs -getmerge /mapreduce/job/output/dir/ /your/local/output/file.txt
like image 148
GS Majumder Avatar answered Oct 13 '22 13:10

GS Majumder


In mapreduce you specify an output folder, the only thing it will contain will be part-r files (which is the output of a reduce task) and a _SUCCESS file (which is empty). So i think if you want to do postprocessing you only need to set the output dir of job1 as the input dir for job 2.

Now there might be some requirements for your postprocessor which can be addressed, is it for example important to process the output files in order?

Or if you just want to process the files locally then it all depends on the outputformat of your mapreduce job, this will tell you how the part-r files are structured. Then you can simple use standard i/o i guess.

like image 33
DDW Avatar answered Oct 13 '22 15:10

DDW


You can probably use Hadoop FileSystem to do the iteration from your application of the part-r-xxxxx files.

FileSystem fs = FileSystem.get(new Configuration());
FileStatus[] status = fs.listStatus(new Path("hdfs://hostname:port/joboutputpath"));
for (int i=0;i<status.length;i++){
    fs.open(status[i].getPath())));
}

You can also look into ChainMapper/ChainReducer.

like image 20
SSaikia_JtheRocker Avatar answered Oct 13 '22 15:10

SSaikia_JtheRocker