Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pig 0.11.1 - Count groups in a time range

I have a dataset, A, that has timestamp, visitor, URL:

(2012-07-21T14:00:00.000Z, joe, hxxp:///www.aaa.com) 
(2012-07-21T14:01:00.000Z, mary, hxxp://www.bbb.com) 
(2012-07-21T14:02:00.000Z, joe, hxxp:///www.aaa.com) 

I want to measure number of visits per user per URL in a time window of say, 10 minutes, but as a rolling window that increments by the minute. Output would be:

(2012-07-21T14:00 to 2012-07-21T14:10, joe, hxxp://www.aaa.com, 2)
(2012-07-21T14:01 to 2012-07-21T14:11, joe, hxxp://www.aaa.com, 1)

To make the arithmetic easy, I change the timestamp to minute of the day, as:

(840, joe, hxxp://www.aaa.com) /* 840 = 14:00 hrs x 60 + 00 mins) */

To iterate over 'A' by a moving time window, I create a dataset B of minutes in the day:

(0)
(1)
(2)
.
.
.
.
(1440)

Ideally, I want to do something like:

A = load 'dataset1' AS (ts, visitor, uri)
B = load 'dataset2' as (minute)

foreach B {
C = filter A by ts > minute AND ts < minute + 10;
D = GROUP C BY (visitor, uri);
foreach D GENERATE group, count(C) as mycnt;
}

DUMP B;

I know "GROUP" isn't allowed inside a "FOREACH" loop but is there a workaround to achieve the same result?

Thanks!

like image 401
Joe Nate Avatar asked Aug 01 '13 20:08

Joe Nate


1 Answers

Maybe you can do something like this?

NOTE: This is dependent on the minutes you create for the logs being integers. If they are not then you can round to the nearest minute.

myudf.py

#!/usr/bin/python

@outputSchema('expanded: {(num:int)}')
def expand(start, end):
        return [ (x) for x in range(start, end) ]

myscript.pig

register 'myudf.py' using jython as myudf ;

-- A1 is the minutes. Schema:
-- A1: {minute: int}
-- A2 is the logs. Schema:
-- A2: {minute: int,name: chararray}
-- These schemas should change to fit your needs.

B = FOREACH A1 GENERATE minute, 
                        FLATTEN(myudf.expand(minute, minute+10)) AS matchto ;
-- B is in the form:
-- 1 1
-- 1 2
-- ....
-- 2 2
-- 2 3
-- ....
-- 100 100
-- 100 101
-- etc.

-- Now we join on the minute in the second column of B with the 
-- minute in the log, then it is just grouping by the minute in
-- the first column and name and counting
C = JOIN B BY matchto, A2 BY minute ;
D = FOREACH (GROUP C BY (B::minute, name)) 
            GENERATE FLATTEN(group), COUNT(C) as count ;

I'm a little worried about speed for larger logs, but it should work. Let me know if you need me to explain anything.

like image 76
mr2ert Avatar answered Oct 09 '22 05:10

mr2ert