Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate median of a numeric sequence in Google BigQuery efficiently?

I need to calculate median value of a numeric sequence in Google BigQuery efficiently. Is the same possible?

like image 854
Manish Agrawal Avatar asked Mar 17 '15 06:03

Manish Agrawal


People also ask

How do you find the median in BigQuery?

To compute the median, we will use the PERCENTILE_CONT(order_value, 0.5) function. The percentile function will go over each row of our datasets and return the median value (this is why we use the 0.5 parameters, meaning 50% of the values are above or below this point) of the order_value column.

How do you find the median in SQL?

To get the median we have to use PERCENTILE_CONT(0.5). If you want to define a specific set of rows grouped to get the median, then use the OVER (PARTITION BY) clause. Here I've used PARTITION BY on the column OrderID so as to find the median of unit prices for the order ids.

How do you calculate percentiles in BigQuery?

To get percentiles, simply ask for 100 quantiles. select percentiles[offset(10)] as p10, percentiles[offset(25)] as p25, percentiles[offset(50)] as p50, percentiles[offset(75)] as p75, percentiles[offset(90)] as p90, from ( select approx_quantiles(char_length(text), 100) percentiles from `bigquery-public-data.


1 Answers

2018 update with more metrics:

BigQuery SQL: Average, geometric mean, remove outliers, median


For my own memory purposes, working queries with taxi data:

Approximate quantiles:

SELECT MONTH(pickup_datetime) month, NTH(51, QUANTILES(tip_amount,101)) median
FROM [nyc-tlc:green.trips_2015]
WHERE tip_amount > 0
GROUP BY 1
ORDER BY 1

Gives the same results as PERCENTILE_DISC:

SELECT month, FIRST(median) median
FROM (
  SELECT MONTH(pickup_datetime) month, tip_amount, PERCENTILE_DISC(0.5) OVER(PARTITION BY month ORDER BY tip_amount) median
  FROM [nyc-tlc:green.trips_2015]
  WHERE tip_amount > 0
)
GROUP BY 1
ORDER BY 1

StandardSQL:

#StandardSQL
SELECT DATE_TRUNC(DATE(pickup_datetime), MONTH) month, APPROX_QUANTILES(tip_amount,1000)[OFFSET(500)] median
FROM `nyc-tlc.green.trips_2015`
WHERE tip_amount > 0
GROUP BY 1
ORDER BY 1
like image 196
Felipe Hoffa Avatar answered Sep 24 '22 22:09

Felipe Hoffa