Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Join of two datasets in Mapreduce/Hadoop

Does anyone know how to implement the Natural-Join operation between two datasets in Hadoop?

More specifically, here's what I exactly need to do:

I am having two sets of data:

  1. point information which is stored as (tile_number, point_id:point_info) , this is a 1:n key-value pairs. This means for every tile_number, there might be several point_id:point_info

  2. Line information which is stored as (tile_number, line_id:line_info) , this is again a 1:m key-value pairs and for every tile_number, there might be more than one line_id:line_info

As you can see the tile_numbers are the same between the two datasets. now what I really need is to join these two datasets based on each tile_number. In other words for every tile_number, we have n point_id:point_info and m line_id:line_info. What I want to do is to join all pairs of point_id:point_info with all pairs of line_id:line_info for every tile_number


In order to clarify, here's an example:

For point pairs:

(tile0, point0)
(tile0, point1)
(tile1, point1)
(tile1, point2)

for line pairs:

(tile0, line0)
(tile0, line1)
(tile1, line2)
(tile1, line3)

what I want is as following:

for tile 0:

 (tile0, point0:line0)
 (tile0, point0:line1)
 (tile0, point1:line0)
 (tile0, point1:line1)

for tile 1:

 (tile1, point1:line2)
 (tile1, point1:line3)
 (tile1, point2:line2)
 (tile1, point2:line3)
like image 561
reza Avatar asked Aug 03 '12 21:08

reza


1 Answers

Use a mapper that outputs titles as keys and points/lines as values. You have to differentiate between the point output values and line output values. For instance you can use a special character (even though a binary approach would be much better).

So the map output will be something like:

 tile0, _point0
 tile1, _point0
 tile2, _point1 
 ...
 tileX, *lineL
 tileY, *lineK
 ...

Then, at the reducer, your input will have this structure:

 tileX, [*lineK, ... , _pointP, ...., *lineM, ..., _pointR]

and you will have to take the values separate the points and the lines, do a cross product and output each pair of the cross-product , like this:

tileX (lineK, pointP)
tileX (lineK, pointR)
...

If you can already easily differentiate between the point values and the line values (depending on your application specifications) you don't need the special characters (*,_)

Regarding the cross-product which you have to do in the reducer: You first iterate through the entire values List, separate them into 2 list:

 List<String> points;
 List<String> lines;

Then do the cross-product using 2 nested for loops. Then iterate through the resulting list and for each element output:

tile(current key), element_of_the_resulting_cross_product_list
like image 200
Razvan Avatar answered Sep 22 '22 09:09

Razvan