Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BigQuery SQL: Average, geometric mean, remove outliers, median

I'm calculating the average time to get a reply on Stack Overflow, and the results make no sense.

#standardSQL

WITH question_answers AS (
  SELECT * 
    , timestamp_diff(answers.first, creation_date, minute) minutes
  FROM (
    SELECT creation_date
      , (SELECT AS STRUCT MIN(creation_date) first, COUNT(*) c
         FROM `bigquery-public-data.stackoverflow.posts_answers` b
         WHERE a.id=b.parent_id
        ) answers
      , SPLIT(tags, '|') tags
    FROM `bigquery-public-data.stackoverflow.posts_questions` a
    WHERE EXTRACT(year FROM creation_date) > 2015
  ), UNNEST(tags) tag
  WHERE tag IN ('java', 'javascript', 'google-bigquery', 'firebase', 'php')
  AND answers.c > 0
)

SELECT tag
  , COUNT(*) questions
  , ROUND(AVG(minutes), 2) first_reply_avg_minutes
FROM question_answers
GROUP BY tag

enter image description here

How should I calculate the average time?

like image 542
Felipe Hoffa Avatar asked Aug 23 '18 08:08

Felipe Hoffa


People also ask

How to calculate the median in BigQuery?

Unfortunately, BigQuery does not have a straight forward median calculation. We need to calculate Median as an analytical function here (as opposed to using aggregate function for mean).

How do you find the geometric mean after removing outliers?

The geometric mean: EXP (AVG (LOG (GREATEST (minutes,1)))) The mean after removing outliers: AVG (q) FROM (SELECT q FROM QUANTILES (q, 100) LIMIT 80 OFFSET 2)). The median: all_minutes [OFFSET (CAST (ARRAY_LENGTH (all_minutes)/2 AS INT64))]

How do you find the percentile in BigQuery?

To calculate the percentile at 50%, BigQuery has 2 functions: PERCENTILE_CONT (x, 0.5) and PERCENTILE_DISC (x, 0.5) (for integer outputs). Note that we have to use OVER clause to calculate the percentile by year, as it is an analytic function.

What is the best way to find the median in SQL?

The median you've used is discrete, not interpolated/continuous so could be lower than the real middle but might not be significant, probably better to use PERCENTILE_CONT (x, 0.5) OVER () AS median in standardSQL see cloud.google.com/bigquery/docs/reference/standard-sql/…


1 Answers

Update 2019: How about sharing some persisted public UDFs?

First one, median:

SELECT fhoffa.x.median([1,1,1,2,3,4,5,100,1000]) 

3.0

Indeed - getting an average time to get answers on Stack Overflow of above 100 hours (>6000 minutes) seems wrong - and is largely driven by outliers.

Instead of doing a simple AVG() you could get:

  • The geometric mean: EXP(AVG(LOG(GREATEST(minutes,1))))
  • The mean after removing outliers: AVG(q) FROM (SELECT q FROM QUANTILES(q, 100) LIMIT 80 OFFSET 2)).
  • The median: all_minutes[OFFSET(CAST(ARRAY_LENGTH(all_minutes)/2 AS INT64))]

The results make way more sense if you use any of those alternatives:

enter image description here

As you can see here, in this case removing outliers give us results similar to the geometric mean - while the median reports even lower numbers. Which one to use? Your choice.

WITH question_answers AS (
  SELECT * 
    , timestamp_diff(answers.first, creation_date, minute) minutes
  FROM (
    SELECT creation_date
      , (SELECT AS STRUCT MIN(creation_date) first, COUNT(*) c
         FROM `bigquery-public-data.stackoverflow.posts_answers` b
         WHERE a.id=b.parent_id
        ) answers
      , SPLIT(tags, '|') tags
    FROM `bigquery-public-data.stackoverflow.posts_questions` a
    WHERE EXTRACT(year FROM creation_date) > 2015
  ), UNNEST(tags) tag
  WHERE tag IN ('java', 'javascript', 'google-bigquery', 'firebase', 'php', 'sql', 'elasticsearch', 'apache-kafka', 'tensorflow')
  AND answers.c > 0
)

SELECT *  EXCEPT(qs, all_minutes)
  , (SELECT ROUND(AVG(q),2) FROM (SELECT q FROM UNNEST(qs) q ORDER BY q LIMIT 80 OFFSET 2)) avg_no_outliers 
  , all_minutes[OFFSET(CAST(ARRAY_LENGTH(all_minutes)/2 AS INT64)  )] median_minutes
FROM (
  SELECT tag
    , COUNT(*) questions
    , ROUND(AVG(minutes), 2) avg_minutes
    , ROUND(EXP(AVG(LOG(GREATEST(minutes,1)))),2) first_reply_avg_minutes_geom
    , APPROX_QUANTILES(minutes, 100) qs
    , ARRAY_AGG(minutes IGNORE NULLS ORDER BY minutes) all_minutes
  FROM question_answers
  GROUP BY tag
)

ORDER BY 2 DESC

Bonus MEDIAN() UDF function from Elliott.

CREATE TEMP FUNCTION MEDIAN(arr ANY TYPE) AS ((
  SELECT
    IF(
      MOD(ARRAY_LENGTH(arr), 2) = 0,
      (arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2) - 1)] + arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))]) / 2,
      arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))]
    )
  FROM (SELECT ARRAY_AGG(x ORDER BY x) AS arr FROM UNNEST(arr) AS x)
));
like image 113
Felipe Hoffa Avatar answered Sep 28 '22 22:09

Felipe Hoffa