I have a pig script that pertains to 2 Pig relations, lets say A and B. A is a small relationship, and B is a big one. My UDF should load all of A into memory on each machine and then use it while processing B. Currently I do it like this.
A = foreach smallRelation Generate ...
B = foreach largeRelation Generate propertyOfB;
store A into 'templocation';
C = foreach B Generate CustomUdf(propertyOfB);
I then have every machine load from 'templocation' to get A.This works, but I have two problems with it.
Does anyone know how it should be done?
Here's a trick that will work for you.
You do a GROUP ALL on A first which "bags" all data in A into one field. Then artificially add a common field on both A and B and join them. This way, foreach tuple in the enhanced B, you will have the full data of A for your UDF to use.
It's like this:
(say originally in A, you have fields fa1, fa2, fa3, in B you have fb1, fb2)
-- add an artificial join key with value 'xx'
B_aux = FOREACH B GENERATE 'xx' AS join_key, fb1, fb2;
A_all = GROUP A ALL;
A_aux = FOREACH A GENERATE 'xx' AS join_key, $1;
A_B_JOINED = JOIN B_aux BY join_key, A_aux BY join_key USING 'replicated';
C = FOREACH A_B_JOINED GENERATE CustomUdf(fb1, fb2, A_all);
since this is replicated join, it's also only map-side join.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With