Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PIG how to count a number of rows in alias

People also ask

How do you count the number of records in a pig?

I did something like this to count the number of rows in an alias in PIG: logs = LOAD 'log' logs_w_one = foreach logs generate 1 as one; logs_group = group logs_w_one all; logs_count = foreach logs_group generate SUM(logs_w_one.

How count function works in pig?

The COUNT() function of Pig Latin is used to get the number of elements in a bag. While counting the number of tuples in a bag, the COUNT() function ignores (will not count) the tuples having a NULL value in the FIRST FIELD.

How do you SUM a column in a pig?

You can use the SUM() function of Pig Latin to get the total of the numeric values of a column in a single-column bag. While computing the total, the SUM() function ignores the NULL values. To get the global sum value, we need to perform a Group All operation, and calculate the sum value using the SUM() function.


COUNT is part of pig see the manual

LOGS= LOAD 'log';
LOGS_GROUP= GROUP LOGS ALL;
LOG_COUNT = FOREACH LOGS_GROUP GENERATE COUNT(LOGS);

Arnon Rotem-Gal-Oz already answered this question a while ago, but I thought some may like this slightly more concise version.

LOGS = LOAD 'log';
LOG_COUNT = FOREACH (GROUP LOGS ALL) GENERATE COUNT(LOGS);

Be careful, with COUNT your first item in the bag must not be null. Else you can use the function COUNT_STAR to count all rows.


Basic counting is done as was stated in other answers, and in the pig documentation:

logs = LOAD 'log';
all_logs_in_a_bag = GROUP logs ALL;
log_count = FOREACH all_logs_in_a_bag GENERATE COUNT(logs);
dump log_count

You are right that counting is inefficient, even when using pig's builtin COUNT because this will use one reducer. However, I had a revelation today that one of the ways to speed it up would be to reduce the RAM utilization of the relation we're counting.

In other words, when counting a relation, we don't actually care about the data itself so let's use as little RAM as possible. You were on the right track with your first iteration of the count script.

logs = LOAD 'log'
ones = FOREACH logs GENERATE 1 AS one:int;
counter_group = GROUP ones ALL;
log_count = FOREACH counter_group GENERATE COUNT(ones);
dump log_count

This will work on much larger relations than the previous script and should be much faster. The main difference between this and your original script is that we don't need to sum anything.

This also doesn't have the same problem as other solutions where null values would impact the count. This will count all the rows, regardless of if the first column is null or not.


USE COUNT_STAR

LOGS= LOAD 'log';
LOGS_GROUP= GROUP LOGS ALL;
LOG_COUNT = FOREACH LOGS_GROUP GENERATE COUNT_STAR(LOGS);

Here is a version with optimization. All the solutions above would require pig to read and write full tuple when counting, this script below just write '1'-s

DEFINE row_count(inBag, name) RETURNS result {
    X = FOREACH $inBag generate 1;
    $result = FOREACH (GROUP X ALL PARALLEL 1) GENERATE '$name', COUNT(X);
};

The use it like

xxx = row_count(rows, 'rows_count');