Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Firebase exported to BigQuery: retention cohorts query

Firebase offer split testing functionality through Firebase remote configuration, but there are lack of ability to filter retention in cohorts sections with user properties (with any property in actual fact).

In quest of solution for this problem i'm looking for BigQuery, in reason of Firebase Analytics provide usable way to export data to this service.

But i stuck with many questions and google has no answer or example which may point me to the right direction.

General questions:

As first step i need to aggregate data which represent same data firebase cohorts do, so i can be sure my calculation is right:

firebase cohorts

Next step should be just apply constrains to the queries, so they match custom user properties.

Here what i get so far:

enter image description here

The main problem – big difference in users calculations. Sometimes it is about 100 users, but sometimes close to 1000.

This is approach i use:

# 1

# Count users with `user_dim.first_open_timestamp_micros` 
# in specified period (w0 – week 1)
# this is the way firebase group users to cohorts 
# (who started app on the same day or during the same week) 
# https://support.google.com/firebase/answer/6317510

SELECT
  COUNT(DISTINCT user_dim.app_info.app_instance_id) as count
FROM
  (
   TABLE_DATE_RANGE
    (
     [admob-app-id-xx:xx_IOS.app_events_], 
     TIMESTAMP('2016-11-20'), 
     TIMESTAMP('2016-11-26')
    )
  )
WHERE
  STRFTIME_UTC_USEC(user_dim.first_open_timestamp_micros, '%Y-%m-%d')
  BETWEEN '2016-11-20' AND '2016-11-26'

# 2

# For each next period count events with 
# same first_open_timestamp
# Here is example for one of the weeks. 
# week 0 is Nov20-Nov26, week 1 is Nov27-Dec03

SELECT
  COUNT(DISTINCT user_dim.app_info.app_instance_id) as count
FROM
  (
   TABLE_DATE_RANGE
    (
     [admob-app-id-xx:xx_IOS.app_events_], 
     TIMESTAMP('2016-11-27'), 
     TIMESTAMP('2016-12-03')
    )
  )
WHERE
  STRFTIME_UTC_USEC(user_dim.first_open_timestamp_micros, '%Y-%m-%d')
  BETWEEN '2016-11-20' AND '2016-11-26'

# 3

# Now we have users for each week w1, w2, ... w5
# Calculate retention for each of them
# retention week 1 = w1 / w0 * 100 = 25.72181359
# rw2 = w2 / w1 * 100
# ...
# rw5 = w5 / w1 * 100

# 4 

# Shift week 0 by one and repeat from step 1

BigQuery queries tips request

Any tips and directions to go about building complex query which may aggregate and calculate all data required for this task in one step is very appreciated.

Here is BigQuery Export schema if needed

Side questions:

  • why all the user_dim.device_info.device_id and user_dim.device_info.resettable_device_idis null?
  • user_dim.app_info.app_id is missing from the doc (if firebase support teammate will be read this question)
  • how event_dim.timestamp_micros and event_dim.previous_timestamp_micros should be used, i can not get their purpose.

PS

It will be good someone from Firebase teammate answer this question. Five month ago there are was one mention about extending cohorts functionality with filtering or show bigqueries examples, but things are not moving. Firebase Analytics is way to go they said, Google Analytics is deprecated, they said. Now i spend second day to lean bigquery and build my own solution over the existing analytics tools. I no, stack overflow is not the place for this comments, but guys are you thinking? Split testing may grammatically affect retention of my app. My app does not sold anything, funnels and events is not valuable metrics in many cases.

like image 224
George Avatar asked Jan 06 '17 16:01

George


Video Answer


1 Answers

Any tips and directions to go about building complex query which may aggregate and calculate all data required for this task in one step is very appreciated.

yes, generic bigquery will work fine

Below is not the most generic version, but can give you an idea
In this example I am using Stack Overflow Data available in Google BigQuery Public Datasets

First sub-select – activities – in most cases the only what you need to re-write to reflect specifics of your data.
What it does is:
a. Defines period you want to set for analysis.
In example below - it is a month - FORMAT_DATE('%Y-%m', ...
But you can use year, week, day or anything else – respectively
• By year - FORMAT_DATE('%Y', DATE(answers.creation_date)) AS period
• By week - FORMAT_DATE('%Y-%W', DATE(answers.creation_date)) AS period
• By day - FORMAT_DATE('%Y-%m-%d', DATE(answers.creation_date)) AS period
• …
b. Also it “filters” only the type of events/activity you need to analyse
for example, `WHERE CONCAT('|', questions.tags, '|') LIKE '%|google-bigquery|%' looks for answers for google-bigquery tagged question

The rest of sub-queries are more-less generic and mostly can be used as is

#standardSQL
WITH activities AS (
  SELECT answers.owner_user_id AS id,
    FORMAT_DATE('%Y-%m', DATE(answers.creation_date)) AS period
  FROM `bigquery-public-data.stackoverflow.posts_answers` AS answers
  JOIN `bigquery-public-data.stackoverflow.posts_questions` AS questions
  ON questions.id = answers.parent_id
  WHERE CONCAT('|', questions.tags, '|') LIKE '%|google-bigquery|%' 
  GROUP BY id, period
), cohorts AS (
  SELECT id, MIN(period) AS cohort FROM activities GROUP BY id
), periods AS (
  SELECT period, ROW_NUMBER() OVER(ORDER BY period) AS num
  FROM (SELECT DISTINCT cohort AS period FROM cohorts)
), cohorts_size AS (
  SELECT cohort, periods.num AS num, COUNT(DISTINCT activities.id) AS ids 
  FROM cohorts JOIN activities ON activities.period = cohorts.cohort AND cohorts.id = activities.id
  JOIN periods ON periods.period = cohorts.cohort
  GROUP BY cohort, num
), retention AS (
  SELECT cohort, activities.period AS period, periods.num AS num, COUNT(DISTINCT cohorts.id) AS ids
  FROM periods JOIN activities ON activities.period = periods.period
  JOIN cohorts ON cohorts.id = activities.id 
  GROUP BY cohort, period, num 
)
SELECT 
  CONCAT(cohorts_size.cohort, ' - ',  FORMAT("%'d", cohorts_size.ids), ' users') AS cohort, 
  retention.num - cohorts_size.num AS period_lag, 
  retention.period as period_label,
  ROUND(retention.ids / cohorts_size.ids * 100, 2) AS retention , retention.ids AS rids
FROM retention
JOIN cohorts_size ON cohorts_size.cohort = retention.cohort
WHERE cohorts_size.cohort >= FORMAT_DATE('%Y-%m', DATE('2015-01-01'))
ORDER BY cohort, period_lag, period_label  

You can visualize result of above query with the tool of your choice
Note: you can use either period_lag or period_label
See the difference of their use in below examples

with period_lag

enter image description here

with period_label

enter image description here

like image 173
Mikhail Berlyant Avatar answered Oct 08 '22 14:10

Mikhail Berlyant