Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to do subselects in BigQuery

We have a huge event table with users registering and playing our games.

Now I want to determine second day retention for each day which is the percentage of players registering the day before that also plays the day after.

So assume we have three fields

timestamp ts
int  userId
int  eventId               (I.e. 1 = Register, 2 = Login)

How is this done in BigQuery syntax? i.e. I would like the following output:

Date         Register    Logins day after    % Second day retention
2013-08-23   25 563      4 567               17.8

I have failed with subselects and joins but it must be doable!

like image 514
Gunnar Eketrapp Avatar asked Dec 26 '22 21:12

Gunnar Eketrapp


1 Answers

How about this query with public data:

SELECT
  a.day, first_day, return_next_day,
  integer((return_next_day / first_day) * 100) percent
FROM (
  SELECT COUNT(DISTINCT actor, 50000) first_day,
    STRFTIME_UTC_USEC(
      UTC_USEC_TO_DAY(PARSE_UTC_USEC(created_at)), "%Y-%m-%d") day,
  FROM
    [publicdata:samples.github_timeline]
  GROUP BY day) a
JOIN (
  SELECT
    COUNT(*) return_next_day, day
  FROM (
    SELECT
      a.day day, a.actor, b.day, b.actor
    FROM (
      SELECT
        STRFTIME_UTC_USEC(
          UTC_USEC_TO_DAY(PARSE_UTC_USEC(created_at)), "%Y-%m-%d") day,
        MAX(STRFTIME_UTC_USEC(86400000000 + UTC_USEC_TO_DAY(
          PARSE_UTC_USEC(created_at)), "%Y-%m-%d")) dayplus,
        actor
      FROM
        [publicdata:samples.github_timeline]
      GROUP EACH BY actor, day) a
    JOIN EACH (
      SELECT
        STRFTIME_UTC_USEC(
          UTC_USEC_TO_DAY(PARSE_UTC_USEC(created_at)), "%Y-%m-%d") day,
        actor
      FROM
        [publicdata:samples.github_timeline]
      GROUP EACH BY actor, day) b
      ON a.actor = b.actor
      AND a.dayplus = b.day
      )
  GROUP BY day) b
  ON a.day = b.day

This gives me the desired results:

Results for the query

Note the query uses STRFTIME_UTC_USEC(UTC_USEC_TO_DAY(PARSE_UTC_USEC(created_at)), "%Y-%m-%d") day many times, to convert the source string data to a date. If I owned the data, I would run an ETL over the table beforehand, to skip this repetitive step.

The query joins 2 tables:

  • First table counts how many different 'actors' where present in an specific date. Note the second parameter on COUNT DISTINCT, to make the count precise.

  • Second table JOINs a given day with the next day, given that the same actor is present in both days. Then you can count how many actors where present in a given day, and in the next day.

  • Joining both tables get you both counts, and you can proceed to divide.

There are alternative ways, this is only one of many approaches. It's also possible to optimize this query even further.

like image 158
Felipe Hoffa Avatar answered Jan 05 '23 16:01

Felipe Hoffa