Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the use of grouping comparator in hadoop map reduce

Tags:

I would like to know why grouping comparator is used in secondary sort of mapreduce.

According to the definitive guide example of secondary sorting

We want the sort order for keys to be by year (ascending) and then by temperature (descending):

1900 35°C 1900 34°C 1900 34°C ... 1901 36°C 1901 35°C 

By setting a partitioner to partition by the year part of the key, we can guarantee that records for the same year go to the same reducer. This still isn’t enough to achieve our goal, however. A partitioner ensures only that one reducer receives all the records for a year; it doesn’t change the fact that the reducer groups by key within the partition.

Since we would have already written our own partitioner which would take care of the map output keys going to particular reducer,so why should we group it.

Thanks in advance

like image 558
Pramod Avatar asked Feb 06 '13 11:02

Pramod


People also ask

What is the purpose of a reducer in MapReduce?

Reducer in Hadoop MapReduce reduces a set of intermediate values which share a key to a smaller set of values. In MapReduce job execution flow, Reducer takes a set of an intermediate key-value pair produced by the mapper as the input.

What is MapReduce in Hadoop?

MapReduce is a Hadoop framework used for writing applications that can process vast amounts of data on large clusters. It can also be called a programming model in which we can process large datasets across computer clusters. This application allows data to be stored in a distributed form.

How Hadoop and MapReduce works together?

MapReduce assigns fragments of data across the nodes in a Hadoop cluster. The goal is to split a dataset into chunks and use an algorithm to process those chunks at the same time. The parallel processing on multiple machines greatly increases the speed of handling even petabytes of data.

What is a MapReduce cluster?

MapReduce is a framework for processing parallelizable problems across large datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively ...


1 Answers

In support of the chosen answer I add:

Following on from this explanation

**Input**:      symbol time price     a      1    10     a      2    20     b      3    30  **Map output**: create composite key\values like so:  > symbol-time time-price > >**a-1**         1-10 > >**a-2**         2-20 > >**b-3**         3-30 

The Partitioner: will route the a-1 and a-2 keys to the same reducer despite the keys being different. It will also route the b-3 to a separate reducer.

GroupComparator: once the composites key\value arrive at the reducer instead of the reducer getting

>(**a-1**,{1-10}) > >(**a-2**,{2-20}) 

the above will happen due to the unique key values following composition.

the group comparator will ensure the reducer gets:

(a-1,{**1-10,2-20**}) 

The key of the grouped values will be the one which comes first in the group. This can be controlled by Key comparator.

**[[In a single reduce method call.]]** 
like image 195
jakstack Avatar answered Oct 05 '22 06:10

jakstack