Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can i do distinct on multiple columns in pig?

Tags:

apache-pig

I have a usecase in which i need to count the distinct number of two fields.

Sample :

x = LOAD 'testdata' using PigStorage('^A') as (a,b,c,d);

y = GROUP x BY a;

z = FOREACH y {

        **bc = DISTINCT x.b,x.c;**
        dd = DISTINCT x.d;
        GENERATE FLATTEN(group) as (a), COUNT(bc), COUNT(dd);
};
like image 505
Pracheer Agarwal Avatar asked Jan 15 '23 01:01

Pracheer Agarwal


1 Answers

You were quite close. The key is to not apply DISTINCT to two fields, but instead to apply it to a single composite field that you create:

x = LOAD 'testdata' using PigStorage('^A') as (a,b,c,d);
x2 = FOREACH x GENERATE a, TOTUPLE(b,c) AS bc, d
y = GROUP x2 BY a;
z = FOREACH y {
        bc = DISTINCT x2.bc;
        dd = DISTINCT x2.d;
        GENERATE FLATTEN(group) AS (a), COUNT(bc), COUNT(dd);
};
like image 159
reo katoa Avatar answered Apr 06 '23 04:04

reo katoa