I need help with this pig script. I am just getting a single record. I am selecting 2 columns and doing a count(distinct) on another while also using a where like clause to find a particular description (desc).
Here's my sql with pig I am trying to code.
/*
For example in sql:
select domain, count(distinct(segment)) as segment_cnt
from table
where desc='ABC123'
group by domain
order by segment_count desc;
*/
A = LOAD 'myoutputfile' USING PigStorage('\u0005')
AS (
domain:chararray,
segment:chararray,
desc:chararray
);
B = filter A by (desc=='ABC123');
C = foreach B generate domain, segment;
D = DISTINCT C;
E = group D all;
F = foreach E generate group, COUNT(D) as segment_cnt;
G = order F by segment_cnt DESC;
It is simple to perform a DISTINCT operation on all of the columns: A = LOAD 'data' AS (a1,a2,a3,a4); A_unique = DISTINCT A; Lets say that I am interested in performing the distinct across a1, a2, and a3.
To get the global count value (total number of tuples in a bag), we need to perform a Group All operation, and calculate the count value using the COUNT() function. To get the count value of a group (Number of tuples in a group), we need to group it using the Group By operator and proceed with the count function.
The correct syntax for using COUNT(DISTINCT) is: SELECT COUNT(DISTINCT Column1) FROM Table; The distinct count will be based off the column in parenthesis. The result set should only be one row, an integer/number of the column you're counting distinct values of.
You could GROUP on each domain and then count the number of distinct elements in each group with a nested FOREACH syntax:
D = group C by domain;
E = foreach D {
unique_segments = DISTINCT C.segment;
generate group, COUNT(unique_segments) as segment_cnt;
};
You can better define this as a macro:
DEFINE DISTINCT_COUNT(A, c) RETURNS dist {
temp = FOREACH $A GENERATE $c;
dist = DISTINCT temp;
groupAll = GROUP dist ALL;
$dist = FOREACH groupAll GENERATE COUNT(dist);
}
Usage:
X = LOAD 'data' AS (x: int);
Y = DISTINCT_COUNT(X, x);
If you need to use it in a FOREACH
instead then the easiest way is something like:
...GENERATE COUNT(Distinct(x))...
Tested on Pig 12.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With