I have a table with two columns: id
and score
. I'd like to create a third column that equals the quantile that an individual's score
falls in. I'd like to do this in BigQuery's standardSQL.
Here's my_table
:
+----+--------+
| id | score |
+----+--------+
| 1 | 2 |
| 2 | 13 |
| 3 | -2 |
| 4 | 7 |
+----+--------+
and afterwards I'd like to have the following table (example shown with quartiles, but I'd be interested in quartiles/quintiles/deciles)
+----+--------+----------+
| id | score | quaRtile |
+----+--------+----------+
| 1 | 2 | 2 |
| 2 | 13 | 4 |
| 3 | -2 | 1 |
| 4 | 7 | 3 |
+----+--------+----------+
It would be excellent if this were to work on 100 million rows. I've looked around to see a couple solutions that seem to use legacy sql, and the solutions using RANK()
functions don't seem to work for really large datasets. Thanks!
To get percentiles, simply ask for 100 quantiles. select percentiles[offset(10)] as p10, percentiles[offset(25)] as p25, percentiles[offset(50)] as p50, percentiles[offset(75)] as p75, percentiles[offset(90)] as p90, from ( select approx_quantiles(char_length(text), 100) percentiles from `bigquery-public-data.
ROW_NUMBER() function in BigQuery. Row_number is a Numbering function which is a subset of Analytic function in BigQuery. In the analytic function, the OVER clause should be included to define a window of rows within a query result set. For each selected window of rows, Row_number function assigns a unique number.
APPROX_QUANTILES Description Returns the approximate boundaries for a group of expression values, where number represents the number of quantiles to create. This function returns an array of number + 1 elements, where the first element is the approximate minimum and the last element is the approximate maximum.
If I understand correctly, you can use ntile()
. For instance, if you wanted a value from 1-4, you can do:
select t.*, ntile(4) over (order by score) as tile
from t;
If you want to enumerate the values, then use rank()
or dense_rank()
:
select t.*, rank() over (order by score) as tile
from t;
I see, your problem is getting the code to work, because BigQuery tends to run out of resources without a partition by
. One method is to break up the score into different groups. I think this logic does what you want:
select *,
( (count(*) over (partition by cast(score / 1000 as int64) order by cast(score / 1000 as int64)) -
count(*) over (partition by cast(score / 1000 as int64))
) +
rank() over (partition by cast(score / 1000 as int64) order by regi_id)
) as therank,
-- rank() over (order by score) as therank
from t;
This breaks the score into 1000 groups (perhaps that is too many for an integer). And then reconstructs the ranking.
If your score has relatively low cardinality, then join
with aggregation works:
select t.*, (running_cnt - cnt + 1) as therank
from t join
(select score, count(*) as cnt, sum(count(*)) over (order by score) as running_cnt
from t
group by score
) s
on t.score = s.score;
Once you have the rank()
(or row_number()
) you can easily calculate the tiles yourself (hint: division).
Output suggest me rank()
:
SELECT *, RANK() OVER (ORDER BY score) as quantile
FROM table t
ORDER BY id;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With