I'm calculating the average time to get a reply on Stack Overflow, and the results make no sense.
#standardSQL
WITH question_answers AS (
SELECT *
, timestamp_diff(answers.first, creation_date, minute) minutes
FROM (
SELECT creation_date
, (SELECT AS STRUCT MIN(creation_date) first, COUNT(*) c
FROM `bigquery-public-data.stackoverflow.posts_answers` b
WHERE a.id=b.parent_id
) answers
, SPLIT(tags, '|') tags
FROM `bigquery-public-data.stackoverflow.posts_questions` a
WHERE EXTRACT(year FROM creation_date) > 2015
), UNNEST(tags) tag
WHERE tag IN ('java', 'javascript', 'google-bigquery', 'firebase', 'php')
AND answers.c > 0
)
SELECT tag
, COUNT(*) questions
, ROUND(AVG(minutes), 2) first_reply_avg_minutes
FROM question_answers
GROUP BY tag
How should I calculate the average time?
Unfortunately, BigQuery does not have a straight forward median calculation. We need to calculate Median as an analytical function here (as opposed to using aggregate function for mean).
The geometric mean: EXP (AVG (LOG (GREATEST (minutes,1)))) The mean after removing outliers: AVG (q) FROM (SELECT q FROM QUANTILES (q, 100) LIMIT 80 OFFSET 2)). The median: all_minutes [OFFSET (CAST (ARRAY_LENGTH (all_minutes)/2 AS INT64))]
To calculate the percentile at 50%, BigQuery has 2 functions: PERCENTILE_CONT (x, 0.5) and PERCENTILE_DISC (x, 0.5) (for integer outputs). Note that we have to use OVER clause to calculate the percentile by year, as it is an analytic function.
The median you've used is discrete, not interpolated/continuous so could be lower than the real middle but might not be significant, probably better to use PERCENTILE_CONT (x, 0.5) OVER () AS median in standardSQL see cloud.google.com/bigquery/docs/reference/standard-sql/…
Update 2019: How about sharing some persisted public UDFs?
First one, median:
SELECT fhoffa.x.median([1,1,1,2,3,4,5,100,1000])
3.0
Indeed - getting an average time to get answers on Stack Overflow of above 100 hours (>6000 minutes) seems wrong - and is largely driven by outliers.
Instead of doing a simple AVG()
you could get:
EXP(AVG(LOG(GREATEST(minutes,1))))
AVG(q) FROM (SELECT q FROM QUANTILES(q, 100) LIMIT 80 OFFSET 2))
.all_minutes[OFFSET(CAST(ARRAY_LENGTH(all_minutes)/2 AS INT64))]
The results make way more sense if you use any of those alternatives:
As you can see here, in this case removing outliers give us results similar to the geometric mean - while the median reports even lower numbers. Which one to use? Your choice.
WITH question_answers AS (
SELECT *
, timestamp_diff(answers.first, creation_date, minute) minutes
FROM (
SELECT creation_date
, (SELECT AS STRUCT MIN(creation_date) first, COUNT(*) c
FROM `bigquery-public-data.stackoverflow.posts_answers` b
WHERE a.id=b.parent_id
) answers
, SPLIT(tags, '|') tags
FROM `bigquery-public-data.stackoverflow.posts_questions` a
WHERE EXTRACT(year FROM creation_date) > 2015
), UNNEST(tags) tag
WHERE tag IN ('java', 'javascript', 'google-bigquery', 'firebase', 'php', 'sql', 'elasticsearch', 'apache-kafka', 'tensorflow')
AND answers.c > 0
)
SELECT * EXCEPT(qs, all_minutes)
, (SELECT ROUND(AVG(q),2) FROM (SELECT q FROM UNNEST(qs) q ORDER BY q LIMIT 80 OFFSET 2)) avg_no_outliers
, all_minutes[OFFSET(CAST(ARRAY_LENGTH(all_minutes)/2 AS INT64) )] median_minutes
FROM (
SELECT tag
, COUNT(*) questions
, ROUND(AVG(minutes), 2) avg_minutes
, ROUND(EXP(AVG(LOG(GREATEST(minutes,1)))),2) first_reply_avg_minutes_geom
, APPROX_QUANTILES(minutes, 100) qs
, ARRAY_AGG(minutes IGNORE NULLS ORDER BY minutes) all_minutes
FROM question_answers
GROUP BY tag
)
ORDER BY 2 DESC
Bonus MEDIAN()
UDF function from Elliott.
CREATE TEMP FUNCTION MEDIAN(arr ANY TYPE) AS ((
SELECT
IF(
MOD(ARRAY_LENGTH(arr), 2) = 0,
(arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2) - 1)] + arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))]) / 2,
arr[OFFSET(DIV(ARRAY_LENGTH(arr), 2))]
)
FROM (SELECT ARRAY_AGG(x ORDER BY x) AS arr FROM UNNEST(arr) AS x)
));
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With