Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bigquery SQL for sliding window aggregate

Hi I have a table that looks like this

Date         Customer   Pageviews
2014/03/01   abc          5
2014/03/02   xyz          8
2014/03/03   abc          6

I want to get page view aggregates grouped by week but showing aggregates for past 30 days - (sliding window aggregates with window-size of 30 days for every week)

I am using google bigquery

EDIT: Gordon - re your comment about "Customer", Actually what I need is slightly more complicated thats why I included customer in the table above. I am looking to get the number of customers who had >n pageviews in a 30day window every week. something like this

Date        Customers>10 pageviews in 30day window
2014/02/01  10
2014/02/08  5
2014/02/15  6
2014/02/22  15

However to keep it simple, I will work my way if I could just get a sliding window aggregate of pageviews ignoring customers altogether. something like this

Date        count of pageviews in 30day window
2014/02/01  50
2014/02/08  55
2014/02/15  65
2014/02/22  75
like image 650
prat Avatar asked Mar 14 '14 21:03

prat


People also ask

Does BigQuery support window functions?

The BigQuery Window Functions allow you to query either a subset or a full set of rows from your dataset and return a value for every row of the results. They are normally used to calculate running totals, rolling averages, and removing duplicate rows from a dataset.

What is array AGG in BigQuery?

ARRAY_AGG. Returns an ARRAY of expression values. To learn more about the optional arguments in this function and how to use them, see Aggregate function calls. To learn more about the OVER clause and how to use it, see Window function calls.

Can BigQuery be used for transactional data?

BigQuery stores data using a columnar storage format that is optimized for analytical queries. BigQuery presents data in tables, rows, and columns and provides full support for database transaction semantics (ACID).

Is BigQuery faster than Postgres?

Better performance The main reason Google BigQuery is better than PostgreSQL is performance. Google BigQuery is 100% elastic, meaning that it allocates the necessary resources required on-demand to run your queries in seconds and is highly optimized for query performance.


2 Answers

How about this:

SELECT changes + changes1 + changes2 + changes3 changes28days, login, USEC_TO_TIMESTAMP(week)
FROM (
  SELECT changes,
         LAG(changes, 1) OVER (PARTITION BY login ORDER BY week) changes1,
         LAG(changes, 2) OVER (PARTITION BY login ORDER BY week) changes2,
         LAG(changes, 3) OVER (PARTITION BY login ORDER BY week) changes3,
         login,
         week
  FROM (
    SELECT SUM(payload_pull_request_changed_files) changes, 
           UTC_USEC_TO_WEEK(created_at, 1) week,
           actor_attributes_login login,
    FROM [publicdata:samples.github_timeline]
    WHERE payload_pull_request_changed_files > 0
    GROUP BY week, login
))
HAVING changes28days > 0

For each user it counts how many changes they have submitted per week. Then with LAG() we can peek into the next row, how many changes they submitted the -1, -2, and -3 week. Then we just add those 4 weeks to see how many changes were submitted on the last 28 days.

Now you can wrap everything in a new query to filter users with changes>X, and count them.

like image 155
Felipe Hoffa Avatar answered Sep 26 '22 07:09

Felipe Hoffa


I have created the following "Times" table:

Table Details: Dim_Periods
Schema
Date    TIMESTAMP   
Year    INTEGER         
Month   INTEGER         
day         INTEGER         
QUARTER INTEGER     
DAYOFWEEK   INTEGER     
MonthStart  TIMESTAMP   
MonthEnd    TIMESTAMP   
WeekStart   TIMESTAMP   
WeekEnd TIMESTAMP   
Back30Days  TIMESTAMP   -- the date 30 days before "Date"
Back7Days   TIMESTAMP   -- the date 7 days before "Date"

and I use such query to handle "running sums"

SELECT Date,Count(*) as MovingCNT
FROM

(SELECT Date,
                Back7Days 
                    FROM DWH.Dim_Periods  
                 where Date < timestamp(current_date()) AND
                             Date >= (DATE_ADD (CURRENT_TIMESTAMP(), -5, 'month'))
                )P
                CROSS JOIN EACH
    (SELECT repository_url,repository_created_at
    FROM publicdata:samples.github_timeline
                ) L
        WHERE timestamp(repository_created_at)>= Back7Days 
              AND timestamp(repository_created_at)<= Date

GROUP EACH BY Date

Note that it can be used for "Month to date", Week to Date" "30 days back" etc. aggregations as well. However, performance is not the best and the query can take a while on larger data sets due to the Cartesian join. Hope this helps

like image 38
N.N. Avatar answered Sep 23 '22 07:09

N.N.