I would like to perform the equivalent of "keep all a
in A
where a.field == b.field
for some b
in B
" in Apache Pig. I am implementing it like so,
AB_joined = JOIN A by field, B by field;
A2 = FOREACH AB_joined GENERATE A::field as field, A::field2 as field2, A::field3 as field3;
Enumerating all of A
's entries is quite silly, and I would rather do something like,
A2 = FOREACH AB_joined GENERATE flatten(A);
However, this doesn't seem to work. Is there some other way I can do something equivalent without enumerating A
's fields?
This should work:
A2 = FOREACH AB_joined GENERATE $0..
You can use COGROUP to keep the columns of A separate from columns of B. This is especially useful when A's schema is dynamic and you don't want your code to fail when A's schema changes.
AB = COGROUP A BY field, B BY field;
-- schema of AB will be:
-- {group, A:{all fields of A}, B:{all fields of B}}
A2 = FOREACH AB FLATTEN(A);
Hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With