Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pig approach to pairing data fields in a data set

I'm new to Pig and trying to correctly implement a somewhat common algorithm in which I need to pair every matching record in a set of records. In order to distill the question into its simplest form and also avoid discussing some business-specific sensitivities, here's a mock problem:

Say that I have a dataset representing college classes and students that attend them:

Philosophy,John
English,Mary
English,Sue
History,Jack
Philosophy,David
English,Mark
English,Larry

I want to pair every association between students that took the same class; so the output would include this, showing the explosion of the four 'English' rows into six associations:

Philosphy   John,David
English    Mary,Sue
English    Mary,Mark
English    Mary,Larry
English    Sue,Mark
English    Sue,Larry
English    Mark,Larry

This page: http://ofps.oreilly.com/titles/9781449302641/advanced_pig_latin.html refers to using flatten() to effect the cross product. I have tried several approaches and researched this extensively and would post my attempts but honestly I'm flailing and I think that would just confuse the reader and not provide any value. But here's the boilerplate:

s = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
grp = group s by class;
...

(I believe the problem I'm facing has to do with flatten requiring multiple bags, not multiple fields, and I can't figure out how to get my group'ing to generate multiple bags...)

Thank you for any assistance!

like image 381
bethesdaboys Avatar asked Dec 02 '12 14:12

bethesdaboys


2 Answers

You can use the UnorderedPairs UDF from LinkedIn's Datafu project. Download the package from here and issue the followings (tested on Pig v0.10.0) :

register '/home/user/datafu/dist/datafu-0.0.4.jar'
define UnorderedPairs datafu.pig.bags.UnorderedPairs();
A = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
B = GROUP A BY class;
C = FOREACH B GENERATE group, FLATTEN(UnorderedPairs(A.student));

When further flattening the result:

D = FOREACH C generate FLATTEN($0) as (class:chararray), 
      FLATTEN($1) as (student1:chararray), FLATTEN($2) as (student2:chararray);

You'll end up having the desired result:

dump D;

(English,Mary,Sue)
(English,Mary,Mark)
(English,Mary,Larry)
(English,Sue,Mark)
(English,Sue,Larry)
(English,Mark,Larry)
(Philosophy,John,David)
like image 76
Lorand Bendig Avatar answered Nov 03 '22 04:11

Lorand Bendig


There are two approaches I see to this. I have not tried either in quite some time, so please follow up and let us know if they worked well or not.


The first approach is a self join

s1 = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
s2 = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
b = JOIN s1 BY class, s2 BY class;
...

The downside of this is that you have to load the data twice. There is some discussion on why this sucks, but it's just how you have to do it.


The other option would be to use CROSS nested in a FOREACH after the GROUP:

Note: I'm not sure at all if this will work, or if I got the syntax right (I'm not in an environment that I could test this right now). Perhaps someone can confirm.

B = GROUP s BY class;
C = FOREACH B {                          
   DA = CROSS s, s;                       
   GENERATE FLATTEN(DA);
}
like image 24
Donald Miner Avatar answered Nov 03 '22 03:11

Donald Miner