We have a huge event table with users registering and playing our games.
Now I want to determine second day retention for each day which is the percentage of players registering the day before that also plays the day after.
So assume we have three fields
timestamp ts
int userId
int eventId (I.e. 1 = Register, 2 = Login)
How is this done in BigQuery syntax? i.e. I would like the following output:
Date Register Logins day after % Second day retention
2013-08-23 25 563 4 567 17.8
I have failed with subselects and joins but it must be doable!
How about this query with public data:
SELECT
a.day, first_day, return_next_day,
integer((return_next_day / first_day) * 100) percent
FROM (
SELECT COUNT(DISTINCT actor, 50000) first_day,
STRFTIME_UTC_USEC(
UTC_USEC_TO_DAY(PARSE_UTC_USEC(created_at)), "%Y-%m-%d") day,
FROM
[publicdata:samples.github_timeline]
GROUP BY day) a
JOIN (
SELECT
COUNT(*) return_next_day, day
FROM (
SELECT
a.day day, a.actor, b.day, b.actor
FROM (
SELECT
STRFTIME_UTC_USEC(
UTC_USEC_TO_DAY(PARSE_UTC_USEC(created_at)), "%Y-%m-%d") day,
MAX(STRFTIME_UTC_USEC(86400000000 + UTC_USEC_TO_DAY(
PARSE_UTC_USEC(created_at)), "%Y-%m-%d")) dayplus,
actor
FROM
[publicdata:samples.github_timeline]
GROUP EACH BY actor, day) a
JOIN EACH (
SELECT
STRFTIME_UTC_USEC(
UTC_USEC_TO_DAY(PARSE_UTC_USEC(created_at)), "%Y-%m-%d") day,
actor
FROM
[publicdata:samples.github_timeline]
GROUP EACH BY actor, day) b
ON a.actor = b.actor
AND a.dayplus = b.day
)
GROUP BY day) b
ON a.day = b.day
This gives me the desired results:
Note the query uses STRFTIME_UTC_USEC(UTC_USEC_TO_DAY(PARSE_UTC_USEC(created_at)), "%Y-%m-%d") day
many times, to convert the source string data to a date. If I owned the data, I would run an ETL over the table beforehand, to skip this repetitive step.
The query joins 2 tables:
First table counts how many different 'actors' where present in an specific date. Note the second parameter on COUNT DISTINCT, to make the count precise.
Second table JOINs a given day with the next day, given that the same actor is present in both days. Then you can count how many actors where present in a given day, and in the next day.
Joining both tables get you both counts, and you can proceed to divide.
There are alternative ways, this is only one of many approaches. It's also possible to optimize this query even further.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With