I'm using PigLatin to filter some records.
User1 8 NYC
User1 9 NYC
User1 7 LA
User2 4 NYC
User2 3 DC
The script should remove the duplicate for users, and keep one of these records. Something like the unique command in linux.
The output should be:
User1 8 NYC
User2 4 NYC
Any suggestions?
For your particular example distinct will not work well as your output contains all of the input columns ($0, $1, $2)
, you can do distinct only on a projection that has columns ($0, $2)
or ($0)
and lose $1
.
In order to select one record per user (any record) you could use a GROUP BY
and a nested FOREACH
with LIMIT
. Ex:
inpt = load '......' ......;
user_grp = GROUP inpt BY $0;
filtered = FOREACH user_grp {
top_rec = LIMIT inpt 1;
GENERATE FLATTEN(top_rec);
};
This approach will help you get records that are unique on a subset of fields and also limit number of output records per each user, which you can control.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With