Hadoop one Map and multiple Reduce

Tags:

mapreduce

We have a large dataset to analyze with multiple reduce functions.

All reduce algorithm work on the same dataset generated by the same map function. Reading the large dataset costs too much to do it every time, it would be better to read only once and pass the mapped data to multiple reduce functions.

Can I do this with Hadoop? I've searched the examples and the intarweb but I could not find any solutions.

589

asked Feb 25 '10 11:02

KARASZI István

2 Answers

Maybe a simple solution would be to write a job that doesn't have a reduce function. So you would pass all the mapped data directly to the output of the job. You just set the number of reducers to zero for the job.

Then you would write a job for each different reduce function that works on that data. This would mean storing all the mapped data on the HDFS though.

Another alternative might be to combine all your reduce functions into a single Reducer which outputs to multiple files, using a different output for each different function. Multiple outputs are mentioned in this article for hadoop 0.19. I'm pretty sure that this feature is broken in the new mapreduce API released with 0.20.1, but you can still use it in the older mapred API.

174

answered Oct 12 '22 04:10

Binary Nerd

Are you expecting every reducer to work on exactly same mapped data? But at least the "key" should be different since it decides which reducer to go.

You can write an output for multiple times in mapper, and output as key (where $i is for the i-th reducer, and $key is your original key). And you need to add a "Partitioner" to make sure these n records are distributed in reducers, based on $i. Then using "GroupingComparator" to group records by original $key.

It's possible to do that, but not in trivial way in one MR.

answered Oct 12 '22 06:10

Victor

Related questions
                            
                                There are 0 datanode(s) running and no node(s) are excluded in this operation
                            
                                How can I access S3/S3n from a local Hadoop 2.6 installation?
                            
                                Do exit codes and exit statuses mean anything in spark?
                            
                                How to list only the file names in HDFS
                            
                                How to specify username when putting files on HDFS from a remote machine?
                            
                                What exactly is hadoop namenode formatting?
                            
                                How to know what is the reason for ClosedChannelExceptions with spark-shell in YARN client mode?
                            
                                what is HiveServer and Thrift server [closed]
                            
                                Sorting large data using MapReduce/Hadoop
                            
                                Hadoop: ...be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation
                            
                                Apache Pig: FLATTEN and parallel execution of reducers
                            
                                what is difference between partition and replica of a topic in kafka cluster
                            
                                Skip first line of csv while loading in hive table
                            
                                Running Apache Hadoop 2.1.0 on Windows
                            
                                Why does Hadoop need classes like Text or IntWritable instead of String or Integer?
                            
                                Why does Hadoop report "Unhealthy Node local-dirs and log-dirs are bad"?
                            
                                How to find the size of a HDFS file
                            
                                Save Spark dataframe as dynamic partitioned table in Hive
                            
                                Hadoop 2.2 Installation `.' no such file or directory
                            
                                Just enough Java for Hadoop [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With