grunt> dump jn;
(k1,k4,10)
(k1,k5,15)
(k2,k4,9)
(k3,k4,16)
grunt> jn = group jn by $1;
grunt> dump jn;
(k4,{(k1,k4,10),(k2,k4,9),(k3,k4,16)})
(k5,{(k1,k5,15)})
Now, from here I want the following output :
(k4,{(k3,k4,16),(k1,k4,10)})
(k5,{(k1,k5,15)})
Bascially, I want to sort on the numbers : 10,9,16 and select the top 2 for every row.
How do I do it?
The ORDER-BY operator is used to display the content of of a relation in a sorted order based on one or more fields. Suppose you have a . txt file and you have LOAD the file into pig. After that, you can sort the details of that file based on any field you want.
The GROUP operator is used to group the data in one or more relations. It collects the data having the same key.
It is simple to perform a DISTINCT operation on all of the columns: A = LOAD 'data' AS (a1,a2,a3,a4); A_unique = DISTINCT A; Lets say that I am interested in performing the distinct across a1, a2, and a3.
This is similar to this question and you could use a Nested FOREACH, e.g.:
A = LOAD 'data';
jn = group A by $1;
B = FOREACH jn {
sorted = ORDER A by $2 ASC;
lim = LIMIT sorted 2;
GENERATE lim;
};
DUMP B;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With