I need to calculate median value of a numeric sequence in Google BigQuery efficiently. Is the same possible?
To compute the median, we will use the PERCENTILE_CONT(order_value, 0.5) function. The percentile function will go over each row of our datasets and return the median value (this is why we use the 0.5 parameters, meaning 50% of the values are above or below this point) of the order_value column.
To get the median we have to use PERCENTILE_CONT(0.5). If you want to define a specific set of rows grouped to get the median, then use the OVER (PARTITION BY) clause. Here I've used PARTITION BY on the column OrderID so as to find the median of unit prices for the order ids.
To get percentiles, simply ask for 100 quantiles. select percentiles[offset(10)] as p10, percentiles[offset(25)] as p25, percentiles[offset(50)] as p50, percentiles[offset(75)] as p75, percentiles[offset(90)] as p90, from ( select approx_quantiles(char_length(text), 100) percentiles from `bigquery-public-data.
2018 update with more metrics:
BigQuery SQL: Average, geometric mean, remove outliers, median
For my own memory purposes, working queries with taxi data:
Approximate quantiles:
SELECT MONTH(pickup_datetime) month, NTH(51, QUANTILES(tip_amount,101)) median
FROM [nyc-tlc:green.trips_2015]
WHERE tip_amount > 0
GROUP BY 1
ORDER BY 1
Gives the same results as PERCENTILE_DISC:
SELECT month, FIRST(median) median
FROM (
SELECT MONTH(pickup_datetime) month, tip_amount, PERCENTILE_DISC(0.5) OVER(PARTITION BY month ORDER BY tip_amount) median
FROM [nyc-tlc:green.trips_2015]
WHERE tip_amount > 0
)
GROUP BY 1
ORDER BY 1
StandardSQL:
#StandardSQL
SELECT DATE_TRUNC(DATE(pickup_datetime), MONTH) month, APPROX_QUANTILES(tip_amount,1000)[OFFSET(500)] median
FROM `nyc-tlc.green.trips_2015`
WHERE tip_amount > 0
GROUP BY 1
ORDER BY 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With