I have data that's already grouped and aggregated, it looks like so:
user value count
---- -------- ------
Alice third 5
Alice first 11
Alice second 10
Alice fourth 2
...
Bob second 20
Bob third 18
Bob first 21
Bob fourth 8
...
For every user (Alice and Bob), I want retrieve their top n values (let's say 2), sorted terms of 'count'. So the desired output I want is this:
Alice first 11
Alice second 10
Bob first 21
Bob second 20
How can I accomplish that?
One approach is
records = LOAD '/user/nubes/ncdc/micro-tab/top.txt' AS (user:chararray,value:chararray,counter:int);
grpd = GROUP records BY user;
top3 = foreach grpd {
sorted = order records by counter desc;
top = limit sorted 2;
generate group, flatten(top);
};
Input is:
Alice third 5
Alice first 11
Alice second 10
Alice fourth 2
Bob second 20
Bob third 18
Bob first 21
Bob fourth 8
Output is:
(Alice,Alice,first,11)
(Alice,Alice,second,10
(Bob,Bob,first,21)
(Bob,Bob,second,20)
I have just made an observation that
top = limit sorted 2;
top is an inbuilt function and may throw an error so the only thing which I did was changed the name of the relation in this case and instead of
generate group, flatten(top);
which was giving the output
(Alice,Alice,first,11)
(Alice,Alice,second,10
(Bob,Bob,first,21)
(Bob,Bob,second,20)
Amended that as shown below -
records = load 'test1.txt' using PigStorage(',') as (user:chararray, value:chararray, count:int);
grpd = GROUP records BY user;
top2 = foreach grpd {
sorted = order records by count desc;
top1 = limit sorted 2;
generate flatten(top1);
};
which gave me the desired output as required by you -
(Alice,first,11)
(Alice,second,10)
(Bob,first,21)
(Bob,second,20)
Hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With