We have a large dataset to analyze with multiple reduce functions.
All reduce algorithm work on the same dataset generated by the same map function. Reading the large dataset costs too much to do it every time, it would be better to read only once and pass the mapped data to multiple reduce functions.
Can I do this with Hadoop? I've searched the examples and the intarweb but I could not find any solutions.
MapReduce assigns fragments of data across the nodes in a Hadoop cluster. The goal is to split a dataset into chunks and use an algorithm to process those chunks at the same time. The parallel processing on multiple machines greatly increases the speed of handling even petabytes of data.
If there are lot of key-values to merge, a single reducer might take too much time. To avoid reducer machine becoming the bottleneck, we use multiple reducers. When you have multiple reducers, each node that is running mapper puts key-values in multiple buckets just after sorting.
setOutputPath(job, outputPath); Adding following lines to the code will yield multiple files to be passed within a single map reduce job.
MapReduce is a programming paradigm that enables massive scalability across hundreds or thousands of servers in a Hadoop cluster. As the processing component, MapReduce is the heart of Apache Hadoop. The term "MapReduce" refers to two separate and distinct tasks that Hadoop programs perform.
Maybe a simple solution would be to write a job that doesn't have a reduce function. So you would pass all the mapped data directly to the output of the job. You just set the number of reducers to zero for the job.
Then you would write a job for each different reduce function that works on that data. This would mean storing all the mapped data on the HDFS though.
Another alternative might be to combine all your reduce functions into a single Reducer which outputs to multiple files, using a different output for each different function. Multiple outputs are mentioned in this article for hadoop 0.19. I'm pretty sure that this feature is broken in the new mapreduce API released with 0.20.1, but you can still use it in the older mapred API.
Are you expecting every reducer to work on exactly same mapped data? But at least the "key" should be different since it decides which reducer to go.
You can write an output for multiple times in mapper, and output as key (where $i is for the i-th reducer, and $key is your original key). And you need to add a "Partitioner" to make sure these n records are distributed in reducers, based on $i. Then using "GroupingComparator" to group records by original $key.
It's possible to do that, but not in trivial way in one MR.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With