I have data like this:
2013-11 localhost kern
2013-11 localhost kern
2013-11 192.168.0.59 daemon
2013-12 localhost kern
2013-12 localhost daemon
2013-12 localhost mail
You get the idea. I'm trying to group the above by date (as the row key) and have a column which correspond to the count of each kern, daemon, etc. In short, my desired output should be as below:
-- date, count(kern), count(daemon), count(mail)
(2013-11, 2, 1, 0)
(2013-12, 1, 1, 1)
Currently, my approach is like this.
valid_records = FILTER formatted_records BY date is not null;
date_group = GROUP valid_records BY date;
facilities = FOREACH date_group {
-- DUMB way to filter for now :(
kern = FILTER valid_records BY facility == 'kern';
user = FILTER valid_records BY facility == 'user';
daemon = FILTER valid_records BY facility == 'daemon';
-- this need to be in order so it get mapped correctly to HBase
GENERATE group, COUNT(kern), COUNT(user), COUNT(daemon);
}
Two questions:
I have 3 filters in above, but in production, there should be more than that maybe 10 filters. Is there any performance hit if I'm using a lot of FILTER like above?
Any other better way to do this?
I think your problem is that you are looking for an output with a floating kind of schema. But it seems that all you have to do is to group by a composite key: with this script:
formatted_records = LOAD 'input' AS (date: chararray, host: chararray, facility: chararray);
valid_records = FILTER formatted_records BY date is not null;
counts = FOREACH (GROUP valid_records BY (date, facility)) GENERATE
group.date AS date,
group.facility AS facility,
COUNT(valid_records) AS the_count;
DUMP counts;
You will get:
(2013-11,kern,2)
(2013-11,daemon,1)
(2013-12,kern,1)
(2013-12,mail,1)
(2013-12,daemon,1)
which gives just the same information.
If you want to format the output in a fancy way like yours then it is better to use a general-purpose language (like Java or Python) for such tasks separately (assuming that Pig's output is small enough to fit in memory). Pig is not good at this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With