We have a huge event table with users registering and playing our games.
Now I want to determine second day retention for each day which is the percentage of players registering the day before that also plays the day after.
So assume we have three fields
timestamp ts
int  userId
int  eventId               (I.e. 1 = Register, 2 = Login)
How is this done in BigQuery syntax? i.e. I would like the following output:
Date         Register    Logins day after    % Second day retention
2013-08-23   25 563      4 567               17.8
I have failed with subselects and joins but it must be doable!
How about this query with public data:
SELECT
  a.day, first_day, return_next_day,
  integer((return_next_day / first_day) * 100) percent
FROM (
  SELECT COUNT(DISTINCT actor, 50000) first_day,
    STRFTIME_UTC_USEC(
      UTC_USEC_TO_DAY(PARSE_UTC_USEC(created_at)), "%Y-%m-%d") day,
  FROM
    [publicdata:samples.github_timeline]
  GROUP BY day) a
JOIN (
  SELECT
    COUNT(*) return_next_day, day
  FROM (
    SELECT
      a.day day, a.actor, b.day, b.actor
    FROM (
      SELECT
        STRFTIME_UTC_USEC(
          UTC_USEC_TO_DAY(PARSE_UTC_USEC(created_at)), "%Y-%m-%d") day,
        MAX(STRFTIME_UTC_USEC(86400000000 + UTC_USEC_TO_DAY(
          PARSE_UTC_USEC(created_at)), "%Y-%m-%d")) dayplus,
        actor
      FROM
        [publicdata:samples.github_timeline]
      GROUP EACH BY actor, day) a
    JOIN EACH (
      SELECT
        STRFTIME_UTC_USEC(
          UTC_USEC_TO_DAY(PARSE_UTC_USEC(created_at)), "%Y-%m-%d") day,
        actor
      FROM
        [publicdata:samples.github_timeline]
      GROUP EACH BY actor, day) b
      ON a.actor = b.actor
      AND a.dayplus = b.day
      )
  GROUP BY day) b
  ON a.day = b.day
This gives me the desired results:

Note the query uses STRFTIME_UTC_USEC(UTC_USEC_TO_DAY(PARSE_UTC_USEC(created_at)), "%Y-%m-%d") day many times, to convert the source string data to a date. If I owned the data, I would run an ETL over the table beforehand, to skip this repetitive step.
The query joins 2 tables:
First table counts how many different 'actors' where present in an specific date. Note the second parameter on COUNT DISTINCT, to make the count precise.
Second table JOINs a given day with the next day, given that the same actor is present in both days. Then you can count how many actors where present in a given day, and in the next day.
Joining both tables get you both counts, and you can proceed to divide.
There are alternative ways, this is only one of many approaches. It's also possible to optimize this query even further.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With