I'm new to Pig and trying to correctly implement a somewhat common algorithm in which I need to pair every matching record in a set of records. In order to distill the question into its simplest form and also avoid discussing some business-specific sensitivities, here's a mock problem:
Say that I have a dataset representing college classes and students that attend them:
Philosophy,John
English,Mary
English,Sue
History,Jack
Philosophy,David
English,Mark
English,Larry
I want to pair every association between students that took the same class; so the output would include this, showing the explosion of the four 'English' rows into six associations:
Philosphy John,David
English Mary,Sue
English Mary,Mark
English Mary,Larry
English Sue,Mark
English Sue,Larry
English Mark,Larry
This page: http://ofps.oreilly.com/titles/9781449302641/advanced_pig_latin.html refers to using flatten() to effect the cross product. I have tried several approaches and researched this extensively and would post my attempts but honestly I'm flailing and I think that would just confuse the reader and not provide any value. But here's the boilerplate:
s = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
grp = group s by class;
...
(I believe the problem I'm facing has to do with flatten requiring multiple bags, not multiple fields, and I can't figure out how to get my group'ing to generate multiple bags...)
Thank you for any assistance!
You can use the UnorderedPairs UDF from LinkedIn's Datafu project. Download the package from here and issue the followings (tested on Pig v0.10.0) :
register '/home/user/datafu/dist/datafu-0.0.4.jar'
define UnorderedPairs datafu.pig.bags.UnorderedPairs();
A = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
B = GROUP A BY class;
C = FOREACH B GENERATE group, FLATTEN(UnorderedPairs(A.student));
When further flattening the result:
D = FOREACH C generate FLATTEN($0) as (class:chararray),
FLATTEN($1) as (student1:chararray), FLATTEN($2) as (student2:chararray);
You'll end up having the desired result:
dump D;
(English,Mary,Sue)
(English,Mary,Mark)
(English,Mary,Larry)
(English,Sue,Mark)
(English,Sue,Larry)
(English,Mark,Larry)
(Philosophy,John,David)
There are two approaches I see to this. I have not tried either in quite some time, so please follow up and let us know if they worked well or not.
The first approach is a self join
s1 = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
s2 = load 'classes' using PigStorage(',') as (class:chararray, student:chararray);
b = JOIN s1 BY class, s2 BY class;
...
The downside of this is that you have to load the data twice. There is some discussion on why this sucks, but it's just how you have to do it.
The other option would be to use CROSS
nested in a FOREACH
after the GROUP
:
Note: I'm not sure at all if this will work, or if I got the syntax right (I'm not in an environment that I could test this right now). Perhaps someone can confirm.
B = GROUP s BY class;
C = FOREACH B {
DA = CROSS s, s;
GENERATE FLATTEN(DA);
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With