Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I calculate moving sum / average on Google BigQuery?

Analyzing trends in data series with too much volatility is hard. In many cases it is useful to use smoothing techniques such as moving averages or moving sums. There are a lot of tool to do this type of operation but when we are talking about millions of rows it is useful to do it directly in a cloud environment such as Google Big Query.

My question is: How can I calculate moving sum/avg on Google Big Query?

Bellow it follows a figure of the moving average average I want to achieve: enter image description here

like image 931
fernandosjp Avatar asked Dec 08 '22 16:12

fernandosjp


2 Answers

Below is for BigQuery Standard SQL

#standardSQL
SELECT
    pickup_date,
    number_of_trip,
    AVG(number_of_trip) OVER (ORDER BY day RANGE BETWEEN 6 PRECEDING AND CURRENT ROW) AS mov_avg_7d,
    AVG(number_of_trip) OVER (ORDER BY day RANGE BETWEEN 27 PRECEDING AND CURRENT ROW) AS mov_avg_28d
FROM (
  SELECT 
    DATE(pickup_datetime) AS pickup_date,
    UNIX_DATE(DATE(pickup_datetime)) AS day,
    COUNT(*) AS number_of_trip
  FROM `nyc-tlc.yellow.trips` 
  GROUP BY 1, 2
)
WHERE pickup_date>'2013-01-01'

From first glance - this answer looks very similar to OP's answer so just few comments about how this answer is different :

First (and least important) - it is for BigQuery Standard SQL which is highly recommended by BigQuery Team to use - unless one has really good reason to use Legacy SQL - for example because of range snapshot or something very specific to legacy sql

Secondly, and most important - using OVER with ROWS in such context is not the best option because it counts rows and not the days, so if - by chance - any given day is missed - calculation will use last 8 and 29 days respectively (instead of 7 and 28)
In such cases one should use OVER with RANGE

like image 76
Mikhail Berlyant Avatar answered Dec 10 '22 05:12

Mikhail Berlyant


I spent a lot of time researching this answer without success so I thought it would be worth it to share it with more people.

Solution: To arrive at the answer I used Big Query's Analytic Functions OVER with ROWS (https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#analytic-function-syntax). Bellow there is an example of 7 day moving average and 28 day moving average of taxi trips using public data available in BigQuery:

SELECT
    pickup_date,
    number_of_trip,
    avg(number_of_trip) OVER (ORDER BY pickup_date ROWS BETWEEN 6 PRECEDING and CURRENT ROW) AS mov_avg_7d,
    avg(number_of_trip) OVER (ORDER BY pickup_date ROWS BETWEEN 27 PRECEDING and CURRENT ROW) AS mov_avg_28d
FROM
    (SELECT 
        date(pickup_datetime) as pickup_date,
        count(*) as number_of_trip,
    FROM [nyc-tlc:yellow.trips] 
    group each by 1
    order by 1)
where pickup_date>'2013-01-01'

Be careful with anti-patterns! there are many posts online that suggest solutions using JOIN or even CROSS JOIN to achieve the same result. However these methods are anti-patterns according to Big Query documentation (https://cloud.google.com/bigquery/docs/best-practices-performance-patterns). That means that for large amounts of data performance will be an issue if you solve the problem using brute force.

like image 24
fernandosjp Avatar answered Dec 10 '22 06:12

fernandosjp