We have been trying quite hard to loop over data in (standard sql) BigQuery to no success.
I am not sure if it is the supported functionality of sql, our undestanding of the problem or the way we want to do this as we want to do it within BigQuery.
Anyhow, let's say we have a table of events where each event is described by a user id and a date (there can be many events on the same date by the same user id)
id STRING
dt DATE
One thing we want to know is how many distinct users generated events within a given period of time. This is rather trivial, just a COUNT on the table with the period as constraint in the WHERE clause. For example, if we have four months as our period of time:
SELECT
COUNT(DISTINCT id) AS total
FROM
`events`
WHERE
dt BETWEEN DATE_ADD(CURRENT_DATE(), INTERVAL -4 MONTH)
AND CURRENT_DATE()
However, our issues come if we want the history as well for other days (or weeks) recursively with the same given period of time. For example, for yesterday, the day before yesterday, etc... till... for example, 3 months ago. So the variable here would be CURRENT_DATE() that goes back by one day or whichever factor but the interval remains the same (in our case, 4 months). We are expecting something like this (with a factor of one day):
2017-07-14 2017-03-14 1760333
2017-07-13 2017-03-13 1856333
2017-07-12 2017-03-12 2031993
...
2017-04-14 2017-01-14 1999352
This is just a loop over every day, week, etc on the same table, and then a COUNT on the distinct events happening within that period of time. But we can't do 'loops' in BigQuery.
One way we thought was a JOIN, and then a COUNT on the GROUP BY intervals (taking advantage of the HAVING clause to simulate the period from a given day back to 4 months), but this is very inefficient and it just doesn't ever finish considering table's size (which has around 254 million records, 173 GB as of today, and it just keeps growing every day).
Another way we thought was using UDFs with the idea that we feed a list of date intervals to the function and then we function would apply the naive query (for counting) for every interval returning the interval and the count for that interval. But... UDFs in BigQuery do not support accessing tables within the UDF so we would have to sort of feed the whole table to the UDF which we haven't tried but doesn't seem reasonable.
So, we have no solution in mind to basically iterate over the same data and do calculations on parts of the data (overlapping parts as you see) within BigQuery and our only solution is doing this outside BigQuery (the loop functionality in the end).
Is there a way or someone can think of a way to do this all within BigQuery? Our goal would be to provide this as a view inside BigQuery so that it doesn't depend on an external system that needs to be triggered at the frequency that we set up (days/weeks/etc...).
Below is example of this technique for BigQuery Standard SQL
#standardSQL
SELECT
DAY,
COUNT(CASE WHEN period = 7 THEN id END) AS days_07,
COUNT(CASE WHEN period = 14 THEN id END) AS days_14,
COUNT(CASE WHEN period = 30 THEN id END) AS days_30
FROM (
SELECT
dates.day AS DAY,
periods.period AS period,
id
FROM yourTable AS activity
CROSS JOIN (SELECT DAY FROM yourTable GROUP BY DAY) AS dates
CROSS JOIN (SELECT period FROM (SELECT 7 AS period UNION ALL
SELECT 14 AS period UNION ALL SELECT 30 AS period)) AS periods
WHERE dates.day >= activity.day
AND CAST(DATE_DIFF(dates.day, activity.day, DAY) / periods.period AS INT64) = 0
GROUP BY 1,2,3
)
GROUP BY DAY
-- ORDER BY DAY
You can play/test with this example using dummy data as below
#standardSQL
WITH data AS (
SELECT
DAY, CAST(10 * RAND() AS INT64) AS id
FROM UNNEST(GENERATE_DATE_ARRAY('2017-01-01', '2017-07-13')) AS DAY
)
SELECT
DAY,
COUNT(DISTINCT CASE WHEN period = 7 THEN id END) AS days_07,
COUNT(DISTINCT CASE WHEN period = 14 THEN id END) AS days_14,
COUNT(DISTINCT CASE WHEN period = 30 THEN id END) AS days_30
FROM (
SELECT
dates.day AS DAY,
periods.period AS period,
id
FROM data AS activity
CROSS JOIN (SELECT DAY FROM data GROUP BY DAY) AS dates
CROSS JOIN (SELECT period FROM (SELECT 7 AS period UNION ALL
SELECT 14 AS period UNION ALL SELECT 30 AS period)) AS periods
WHERE dates.day >= activity.day
AND CAST(DATE_DIFF(dates.day, activity.day, DAY) / periods.period AS INT64) = 0
GROUP BY 1,2,3
)
GROUP BY DAY
ORDER BY DAY
Does it work for you?
WITH dates AS(
SELECT GENERATE_DATE_ARRAY(DATE_SUB(CURRENT_DATE(), INTERVAL 4 MONTH), CURRENT_DATE()) arr_dates
),
data AS(
SELECT 1 id, '2017-03-14' dt UNION ALL
SELECT 1 id, '2017-03-14' dt UNION ALL
SELECT 1, '2017-04-20' UNION ALL
SELECT 2, '2017-04-20' UNION ALL
SELECT 3, '2017-03-15' UNION ALL
SELECT 4, '2017-04-20' UNION ALL
SELECT 5, '2017-07-14'
)
SELECT
i_date date,
DATE_ADD(i_date, INTERVAL 4 MONTH) next_date,
(SELECT COUNT(DISTINCT id) FROM data WHERE PARSE_DATE("%Y-%m-%d", data.dt) BETWEEN i_date AND DATE_ADD(i_date, INTERVAL 4 MONTH)) total
FROM dates,
UNNEST(arr_dates) i_date
ORDER BY i_date
Where data
is a simulation of your events
table.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With