Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pig, how to refer to a field after a join and a group by

I have this code in Pig (win, request and response are just tables loaded directly from filesystem):

win_request = JOIN win BY bid_id, request BY bid_id;
win_request_response = JOIN win_request BY win.bid_id, response BY bid_id;

win_group = GROUP win_request_response BY (win.campaign_id);

win_count = FOREACH win_group GENERATE group, SUM(win.bid_price);

Basically I want to sum the bid_price after joining and grouping, but I get an error:

Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast.

My guess is that I'm not referring correctly to win.bid_price.

like image 348
Jorge González Lorenzo Avatar asked Oct 30 '12 18:10

Jorge González Lorenzo


People also ask

What does group by do in pig?

The GROUP operator is used to group the data in one or more relations. It collects the data having the same key.

What are the relational operators available related to grouping and joining in pig language?

Q46 What are the relational operators available related to combining and splitting in pig language? Answer: UNION and SPLIT used for combining and splitting relations in the pig.

Which keyword is used to specify the type of joint to be performed in the join operator?

Use the JOIN keyword to specify that the tables should be joined. Combine JOIN with other join-related keywords (e.g. INNER or OUTER ) to specify the type of join.

What is flatten in pig?

The FLATTEN operator looks like a UDF syntactically, but it is actually an operator that changes the structure of tuples and bags in a way that a UDF cannot. Flatten un-nests tuples as well as bags. The idea is the same, but the operation and result is different for each type of structure.


1 Answers

When performing multiple joins I recommend using unique identifiers for your fields (e.g. for bid_id). Alternatively, you can also use the disambiguation operator '::', but that can get pretty dirty.

wins = LOAD '/user/hadoop/rtb/wins' USING PigStorage(',') AS (f1_w:int, f2_w:int,  f3_w:chararray);
reqs = LOAD '/user/hadoop/rtb/reqs' USING PigStorage(',') AS (f1_r:int, f2_r:int, f3_r:chararray);
resps = LOAD '/user/hadoop/rtb/resps' USING PigStorage(',') AS (f1_rp:int, f2_rp:int, f3_rp:chararray);

wins_reqs = JOIN wins BY f1_w, reqs BY f1_r;
wins_reqs_reps = JOIN wins_reqs BY f1_r, resps BY f1_rp;

win_group = GROUP wins_reqs_reps BY (f3_w);

win_sum = FOREACH win_group GENERATE group, SUM(wins_reqs_reps.f2_w);
like image 132
Frederic Avatar answered Nov 27 '22 09:11

Frederic