Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to decide when to use a Map-Side Join or Reduce-Side while writing an MR code in java?

How to decide when to use a Map-Side Join or Reduce-Side while writing an MR code in java?

like image 609
jkalyanc Avatar asked Apr 19 '15 19:04

jkalyanc


2 Answers

Map side join performs join before data reached to Map. Map function expects a strong prerequisites before joining data at map side. Both method have some pros and cons. Map side join is efficient compare to reduce side but it require strict format.

Prerequisites:

  • Data should be partitioned and sorted in particular way.
  • Each input data should be divided in same number of partition.
  • Must be sorted with same key.
  • All the records for a particular key must reside in the same partition.

Reduce side join also called as Repartitioned join or Repartitioned sort merge join and also it is mostly used join type. It will have to go through sort and shuffle phase which would incur network overhead.Reduce side join uses few terms like data source, tag and group key lets be familiar with it.

  • Data Source is referring to data source files, probably taken from RDBMS
  • Tag would be used to tag every record with it’s source name, so that it’s source can be identified at any given point of time be it is in map/reduce phase. why it is required will cover it later.
  • Group key is referring column to be used as join key between two data sources.

As we know we are going to join this data on reduce side we must prepare in a way that it can be used for joining in reduce phase. let’s have a look what are the steps needs to be perform.

For more information check this link: http://hadoopinterviews.com/map-side-join-reduce-side-join/

like image 153
chandu kavar Avatar answered Nov 14 '22 18:11

chandu kavar


You will use mapside join if one of your table can be fit in memory which will reduce overhead on your sort and shuffle data.

Reduce-Side joins are more simple than Map-Side joins since the input datasets need not to be structured. But it is less efficient as both datasets have to go through the MapReduce shuffle phase. the records with the same key are brought together in the reducer.

like image 30
Karthik Avatar answered Nov 14 '22 17:11

Karthik