Analyzing trends in data series with too much volatility is hard. In many cases it is useful to use smoothing techniques such as moving averages or moving sums. There are a lot of tool to do this type of operation but when we are talking about millions of rows it is useful to do it directly in a cloud environment such as Google Big Query.
My question is: How can I calculate moving sum/avg on Google Big Query?
Bellow it follows a figure of the moving average average I want to achieve:
Below is for BigQuery Standard SQL
#standardSQL
SELECT
pickup_date,
number_of_trip,
AVG(number_of_trip) OVER (ORDER BY day RANGE BETWEEN 6 PRECEDING AND CURRENT ROW) AS mov_avg_7d,
AVG(number_of_trip) OVER (ORDER BY day RANGE BETWEEN 27 PRECEDING AND CURRENT ROW) AS mov_avg_28d
FROM (
SELECT
DATE(pickup_datetime) AS pickup_date,
UNIX_DATE(DATE(pickup_datetime)) AS day,
COUNT(*) AS number_of_trip
FROM `nyc-tlc.yellow.trips`
GROUP BY 1, 2
)
WHERE pickup_date>'2013-01-01'
From first glance - this answer looks very similar to OP's answer so just few comments about how this answer is different :
First (and least important) - it is for BigQuery Standard SQL which is highly recommended by BigQuery Team to use - unless one has really good reason to use Legacy SQL - for example because of range snapshot or something very specific to legacy sql
Secondly, and most important - using OVER with ROWS in such context is not the best option because it counts rows and not the days, so if - by chance - any given day is missed - calculation will use last 8 and 29 days respectively (instead of 7 and 28)
In such cases one should use OVER with RANGE
I spent a lot of time researching this answer without success so I thought it would be worth it to share it with more people.
Solution: To arrive at the answer I used Big Query's Analytic Functions OVER
with ROWS
(https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#analytic-function-syntax). Bellow there is an example of 7 day moving average and 28 day moving average of taxi trips using public data available in BigQuery:
SELECT
pickup_date,
number_of_trip,
avg(number_of_trip) OVER (ORDER BY pickup_date ROWS BETWEEN 6 PRECEDING and CURRENT ROW) AS mov_avg_7d,
avg(number_of_trip) OVER (ORDER BY pickup_date ROWS BETWEEN 27 PRECEDING and CURRENT ROW) AS mov_avg_28d
FROM
(SELECT
date(pickup_datetime) as pickup_date,
count(*) as number_of_trip,
FROM [nyc-tlc:yellow.trips]
group each by 1
order by 1)
where pickup_date>'2013-01-01'
Be careful with anti-patterns! there are many posts online that suggest solutions using JOIN
or even CROSS JOIN
to achieve the same result. However these methods are anti-patterns according to Big Query documentation (https://cloud.google.com/bigquery/docs/best-practices-performance-patterns). That means that for large amounts of data performance will be an issue if you solve the problem using brute force.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With