is the output of map phase of the mapreduce job always sorted?

Tags:

I am a bit confused with the output I get from Mapper.

For example, when I run a simple wordcount program, with this input text:

hello world
Hadoop programming
mapreduce wordcount
lets see if this works
12345678
hello world
mapreduce wordcount

this is the output that I get:

12345678    1
Hadoop  1
hello   1
hello   1
if  1
lets    1
mapreduce   1
mapreduce   1
programming 1
see 1
this    1
wordcount   1
wordcount   1
works   1
world   1
world   1

As you can see, the output from mapper is already sorted. I did not run Reducer at all. But I find in a different project that the output from mapper is not sorted. So I am totally clear about this..

My questions are:

Is the mapper's output always sorted?
Is the sort phase integrated into the mapper phase already, so that the output of map phase is already sorted in the intermediate data?
Is there a way to collect the data from sort and shuffle phase and persist it before it goes to Reducer? A reducer is presented with a key and a list of iterables. Is there a way, I could persist this data?

694

asked Jul 16 '14 01:07

brain storm

1 Answers

Is the mapper's output always sorted?

No. It is not sorted if you use no reducer. If you use a reducer, there is a pre-sorting process before the mapper's output is written to disk. Data gets sorted in the Reduce phase. What is happening here (just a guess) is that you are not specifying a Reducer class, which, in the new API, is translated into using the Identity Reducer (see this answer and comment). The Identity Reducer just outputs its input. To verify that, see the default Reducer counters (there should be some reduce tasks, reduce input records & groups, reduce output records...)

Is the sort phase integrated into the mapper phase already, so that the output of map phase is already sorted in the intermediate data?

As I explained in the previous question, if you use no reducers, mapper does not sort the data. If you do use reducers, the data start getting sorted from the map phase and then get merge-sorted in the reduce phase.

Is there a way to collect the data from sort and shuffle phase and persist it before it goes to Reducer. A reducer is presented with a key and a list of iterables. Is there a way, I could persist this data?

Again, shuffling and sorting are parts of the Reduce phase. An Identity Reducer will do what you want. If you want to output one key-value pair per reducer, with the values being a concatenation of the iterables, just store the iterables in memory (e.g. in a StringBuffer) and then output this concatenation as a value. If you want the map output to go straight to the program's output, without going through a reduce phase, then set in the driver class the number of reduce tasks to zero, like that:

job.setNumReduceTasks(0);

This will not get your output sorted, though. It will skip the pre-sorting process of the mapper and write the output directly to HDFS.

answered Sep 20 '22 12:09

vefthym

Related questions
                            
                                How to read parquet files using `ssc.fileStream()`? What are the types passed to `ssc.fileStream()`?
                            
                                How can I read in a binary file from hdfs into a Spark dataframe?
                            
                                Install Spark on an existing Hadoop cluster
                            
                                Distributed alternatives to hadoop
                            
                                Conditional join in Hive
                            
                                Oozie coordinator action rerun from fail nodes
                            
                                How to configure monopolistic FIFO application queue in YARN?
                            
                                pickle.PicklingError: args[0] from __newobj__ args has the wrong class with hadoop python
                            
                                How to set range for limit clause in hive
                            
                                Failed to start NameNode
                            
                                Spark: unable to load native-hadoop library for platform
                            
                                Is there a canonical problem that provably can't be aided with map/reduce?
                            
                                How to contribute to apache?
                            
                                Check if table exists
                            
                                "hadoop namenode -format" returns a java.net.UnknownHostException
                            
                                Can I force my reducers (copy phase) to start only when all mappers are completed
                            
                                null values getting uploaded into hive table from a csv file
                            
                                Hadoop node taking a long time to decommission
                            
                                Hadoop Map Reduce For Google web graph
                            
                                Exporting a Scikit Learn Random Forest for use on Hadoop Platform

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

is the output of map phase of the mapreduce job always sorted?

Tags:

hadoop

hadoop2

mapreduce

brain storm

People also ask

1 Answers

vefthym

Recent Activity

Donate For Us