Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sorting large data using MapReduce/Hadoop

I am reading about MapReduce and the following thing is confusing me.

Suppose we have a file with 1 million entries(integers) and we want to sort them using MapReduce. The way i understood to go about it is as follows:

Write a mapper function that sorts integers. So the framework will divide the input file into multiple chunks and would give them to different mappers. Each mapper will sort their chunk of data independent of each other. Once all the mappers are done, we will pass each of their results to Reducer and it will combine the result and give me the final output.

My doubt is, if we have one reducer, then how does it leverage the distributed framework, if, eventually, we have to combine the result at one place?. The problem drills down to merging 1 million entries at one place. Is that so or am i missing something?

Thanks, Chander

like image 301
Chander Shivdasani Avatar asked Sep 02 '10 06:09

Chander Shivdasani


People also ask

Can we use MapReduce to sort big data?

MapReduce is suitable for iterative computation involving large quantities of data requiring parallel processing. It represents a data flow rather than a procedure. It's also suitable for large-scale graph analysis; in fact, MapReduce was originally developed for determining PageRank of web documents.

Which sorting method is used in MapReduce?

Merge sort is the default feature of MapReduce.

How does MapReduce sort algorithm work?

Sorting is the basic MapReduce algorithm that processes and analyzes the given data. The sorting algorithm is implemented by MapReduce to sort the output key-value pairs from the mapper with respect to their keys. Sorting methods are applied within the mapper class.

How do you sort large data?

For sorting a very large file , we can use external sorting technique. External sorting is an algorithm that can handle massive amounts of data. It is required when the data to be sorted does not fit into the main memory and instead they reside in the slower external memory . It uses a hybrid sort-merge strategy.


1 Answers

Check out merge-sort.

It turns out that sorting partially sorted lists is much more efficient in terms of operations and memory consumption than sorting the complete list.

If the reducer gets 4 sorted lists it only needs to look for the smallest element of the 4 lists and pick that one. If the number of lists is constant this reducing is an O(N) operation.

Also typically the reducers are also "distributed" in something like a tree, so the work can be parrallelized too.

like image 131
Peter Tillemans Avatar answered Sep 23 '22 15:09

Peter Tillemans