Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing duplicates using PigLatin

Tags:

apache-pig

I'm using PigLatin to filter some records.

User1  8 NYC 
User1  9 NYC 
User1  7 LA 
User2  4 NYC
User2  3 DC 

The script should remove the duplicate for users, and keep one of these records. Something like the unique command in linux.

The output should be:

User1 8 NYC 
User2 4 NYC

Any suggestions?

like image 672
aalsum Avatar asked Jul 18 '12 03:07

aalsum


1 Answers

For your particular example distinct will not work well as your output contains all of the input columns ($0, $1, $2), you can do distinct only on a projection that has columns ($0, $2) or ($0) and lose $1.

In order to select one record per user (any record) you could use a GROUP BY and a nested FOREACH with LIMIT. Ex:

inpt = load '......' ......;
user_grp = GROUP inpt BY $0;
filtered = FOREACH user_grp {
      top_rec = LIMIT inpt 1;
      GENERATE FLATTEN(top_rec);
};

This approach will help you get records that are unique on a subset of fields and also limit number of output records per each user, which you can control.

like image 63
alexeipab Avatar answered Oct 02 '22 14:10

alexeipab