I am learning how to use Hadoop Pig now.
If I have a input file like this:
a,b,c,true
s,c,v,false
a,s,b,true
...
The last field is the one I need to count... So I want to know how many 'true' and 'false' in this file.
I try:
records = LOAD 'test/input.csv' USING PigStorage(',');
boolean = foreach records generate $3;
groups = group boolean all;
Now I gets stuck. I want to use:
count = foreach groups generate count('true');"
To get the number of "true" but I always get the error:
2013-08-07 16:32:36,677 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve count using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.] Details at logfile: /etc/pig/pig_1375911119028.log
Can anybody tell me where the problem is?
Two things. Firstly, count
should actually be COUNT
. In pig, all builtin functions should be called with all-caps.
Secondly, COUNT
counts the number of values in a bag, not for a value. Therefore, you should group by true/false, then COUNT
:
boolean = FOREACH records GENERATE $3 AS trueORfalse ;
groups = GROUP boolean BY trueORfalse ;
counts = FOREACH groups GENERATE group AS trueORfalse, COUNT(boolean) ;
So now the output of a DUMP
for counts
will look something like:
(true, 2)
(false, 1)
If you want the counts of true and false in their own relations then you can FILTER
the output of counts
. However, it would probably be better to SPLIT
boolean
, then do two separate counts:
boolean = FOREACH records GENERATE $3 AS trueORfalse ;
SPLIT boolean INTO alltrue IF trueORfalse == 'true',
allfalse IF trueORfalse == 'false' ;
tcount = FOREACH (GROUP alltrue ALL) GENERATE COUNT(alltrue) ;
fcount = FOREACH (GROUP allfalse ALL) GENERATE COUNT(allfalse) ;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With