Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop Pig count number

I am learning how to use Hadoop Pig now.

If I have a input file like this:

a,b,c,true
s,c,v,false
a,s,b,true
...

The last field is the one I need to count... So I want to know how many 'true' and 'false' in this file.

I try:

records = LOAD 'test/input.csv' USING PigStorage(',');
boolean = foreach records generate $3;
groups = group boolean all;

Now I gets stuck. I want to use:

count = foreach groups generate count('true');" 

To get the number of "true" but I always get the error:

2013-08-07 16:32:36,677 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve count using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.] Details at logfile: /etc/pig/pig_1375911119028.log

Can anybody tell me where the problem is?

like image 930
user2597504 Avatar asked Aug 07 '13 22:08

user2597504


1 Answers

Two things. Firstly, count should actually be COUNT. In pig, all builtin functions should be called with all-caps.

Secondly, COUNT counts the number of values in a bag, not for a value. Therefore, you should group by true/false, then COUNT:

boolean = FOREACH records GENERATE $3 AS trueORfalse ;
groups = GROUP boolean BY trueORfalse ;
counts = FOREACH groups GENERATE group AS trueORfalse, COUNT(boolean) ;

So now the output of a DUMP for counts will look something like:

(true, 2)
(false, 1)

If you want the counts of true and false in their own relations then you can FILTER the output of counts. However, it would probably be better to SPLIT boolean, then do two separate counts:

boolean = FOREACH records GENERATE $3 AS trueORfalse ;
SPLIT boolean INTO alltrue IF trueORfalse == 'true', 
                   allfalse IF trueORfalse == 'false' ;

tcount = FOREACH (GROUP alltrue ALL) GENERATE COUNT(alltrue) ;
fcount = FOREACH (GROUP allfalse ALL) GENERATE COUNT(allfalse) ;
like image 159
mr2ert Avatar answered Sep 27 '22 02:09

mr2ert