I am using PIG to generate groups from tuples as follows:
a1, b1
a1, b2
a1, b3
...
->
a1, [b1, b2, b3]
...
This is easy and working. But my problem is to get the following: From the obtained groups, I would like to generate a set of all tuples in the group's bag:
a1, [b1, b2, b3]
->
b1,b2
b1,b3
b2,b3
This would be easy if I could nest "foreach" and firstly iterate over each group and then over its bag.
I suppose I am misunderstanding the concept and I will appreciate your explanation.
Thanks.
It looks like you need a Cartesian product between the bag and itself. To do this you need to use FLATTEN(bag) twice.
Code:
inpt = load '.../group.txt' using PigStorage(',') as (id, val);
grp = group inpt by (id);
id_grp = foreach grp generate group as id, inpt.val as value_bag;
result = foreach id_grp generate id, FLATTEN(value_bag) as v1, FLATTEN(value_bag) as v2;
dump result;
Be aware that large bags will produce a lot of rows. To avoid it you could use TOP(...) before FLATTEN:
inpt = load '....group.txt' using PigStorage(',') as (id, val);
grp = group inpt by (id);
id_grp = foreach grp generate group as id, inpt.val as values;
result = foreach id_grp {
limited_bag = TOP(50, 0, values); -- all sorts of filtering could be done here
generate id, FLATTEN(limited_bag) as v1, FLATTEN(limited_bag) as v2;
};
dump result;
For your specific output you could use some filtering before FLATTEN:
inpt = load '..../group.txt' as (id, val);
grp = group inpt by (id);
id_grp = foreach grp generate group as id, inpt.val as values;
result = foreach id_grp {
l = filter values by val == 'b1' or val == 'b2';
generate id, FLATTEN(l) as v1, FLATTEN(values) as v2;
};
result = filter result by v1 != v2;
I hope it helps.
Cheers
Also relevant is this UnorderedPairs function from the DataFu UDF library. It generates pairs of all items in a bag (in your case your grouped bag)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With