In my CENSUS table, I'd like to group by State, and for each State get the median county population and the number of counties.
In psql, redshift, and snowflake, I can do this:
psql=> SELECT state, count(county), PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "population2000") AS median FROM CENSUS GROUP BY state;
state | count | median
----------------------+-------+----------
Alabama | 67 | 36583
Alaska | 24 | 7296.5
Arizona | 15 | 116320
Arkansas | 75 | 20229
...
I'm trying to find a nice way to do this in standard BigQuery. I've noticed that there's undocumented percentile_cont
analytical function available, but I have to do some major hacks to get it to do what I want.
I'd like to be able to do the same sort thing with what I've gathered are the correct arguments:
SELECT
state,
COUNT(county),
PERCENTILE_CONT(population2000,
0.5) OVER () AS `medPop`
FROM
CENSUS
GROUP BY
state;
but this query yields the error
SELECT list expression references column population2000 which is neither grouped nor aggregated at
I can get the answer I want, but I'd be very disappointed if this is the recommended way to do what I want to do:
SELECT
MAX(nCounties) AS nCounties,
state,
MAX(medPop) AS medPop
FROM (
SELECT
nCounties,
T1.state,
(PERCENTILE_CONT(population2000,
0.5) OVER (PARTITION BY T1.state)) AS `medPop`
FROM
census T1
LEFT OUTER JOIN (
SELECT
COUNT(county) AS `nCounties`,
state
FROM
census
GROUP BY
state) T2
ON
T1.state = T2.state) T3
GROUP BY
state
Is there a better way to do what I want to do? Also, is the PERCENTILE_CONT
function ever going to be documented?
Thanks for reading!
To get percentiles, simply ask for 100 quantiles. select percentiles[offset(10)] as p10, percentiles[offset(25)] as p25, percentiles[offset(50)] as p50, percentiles[offset(75)] as p75, percentiles[offset(90)] as p90, from ( select approx_quantiles(char_length(text), 100) percentiles from `bigquery-public-data.
PERCENTILE_CONT is an inverse distribution function that assumes a continuous distribution model. It takes a percentile value and a sort specification, and returns an interpolated value that would fall into that percentile value with respect to the sort specification.
Thanks for your interest. PERCENTILE_CONT is under development, and we will publish the documentation once it is GA. We will support it as analytic function first, and we plan to support it as aggregate function (allowing GROUP BY) later. Between these 2 releases, a simpler workaround would be
SELECT
state,
ANY_VALUE(nCounties) AS nCounties,
ANY_VALUE(medPop) AS medPop
FROM (
SELECT
state,
COUNT(county) OVER (PARTITION BY state) AS nCounties,
PERCENTILE_CONT(population2000,
0.5) OVER (PARTITION BY state) AS medPop
FROM
CENSUS)
GROUP BY
state
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With