Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

merge output files after reduce phase

In mapreduce each reduce task write its output to a file named part-r-nnnnn where nnnnn is a partition ID associated with the reduce task. Does map/reduce merge these files? If yes, how?

like image 297
Shahryar Avatar asked Apr 18 '11 08:04

Shahryar


People also ask

Where does the output of a reducer get stored?

In Hadoop, Reducer takes the output of the Mapper (intermediate key-value pair) process each of them to generate the output. The output of the reducer is the final output, which is stored in HDFS. Usually, in the Hadoop Reducer, we do aggregation or summation sort of computation.

What happens in reducer phase?

Reducer is a phase in hadoop which comes after Mapper phase. The output of the mapper is given as the input for Reducer which processes and produces a new set of output, which will be stored in the HDFS.

How do I merge files in Hadoop?

Hadoop -getmerge command is used to merge multiple files in an HDFS(Hadoop Distributed File System) and then put it into one single output file in our local file system. We want to merge the 2 files present inside are HDFS i.e. file1. txt and file2. txt, into a single file output.

In what form is reducer output presented compressed?

i. MapReduce default Hadoop reducer Output Format is TextOutputFormat, which writes (key, value) pairs on individual lines of text files and its keys and values can be of any type since TextOutputFormat turns them to string by calling toString() on them.


1 Answers

Instead of doing the file merging on your own, you can delegate the entire merging of the reduce output files by calling:

hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt 

Note This combines the HDFS files locally. Make sure you have enough disk space before running

like image 77
diliop Avatar answered Sep 20 '22 08:09

diliop