Create column for the quantile number of a value in BigQuery

I have a table with two columns: id and score. I'd like to create a third column that equals the quantile that an individual's score falls in. I'd like to do this in BigQuery's standardSQL.

Here's my_table:

+----+--------+
| id | score  |
+----+--------+
|  1 |      2 |
|  2 |     13 |
|  3 |     -2 |
|  4 |      7 |
+----+--------+

and afterwards I'd like to have the following table (example shown with quartiles, but I'd be interested in quartiles/quintiles/deciles)

+----+--------+----------+
| id | score  | quaRtile |
+----+--------+----------+
|  1 |      2 |        2 |
|  2 |     13 |        4 |
|  3 |     -2 |        1 |
|  4 |      7 |        3 |
+----+--------+----------+

It would be excellent if this were to work on 100 million rows. I've looked around to see a couple solutions that seem to use legacy sql, and the solutions using RANK() functions don't seem to work for really large datasets. Thanks!

How do you find the percentile of a BigQuery?

To get percentiles, simply ask for 100 quantiles. select percentiles[offset(10)] as p10, percentiles[offset(25)] as p25, percentiles[offset(50)] as p50, percentiles[offset(75)] as p75, percentiles[offset(90)] as p90, from ( select approx_quantiles(char_length(text), 100) percentiles from `bigquery-public-data.

What is Rownum in BigQuery?

ROW_NUMBER() function in BigQuery. Row_number is a Numbering function which is a subset of Analytic function in BigQuery. In the analytic function, the OVER clause should be included to define a window of rows within a query result set. For each selected window of rows, Row_number function assigns a unique number.

What is Approx_quantiles?

APPROX_QUANTILES Description Returns the approximate boundaries for a group of expression values, where number represents the number of quantiles to create. This function returns an array of number + 1 elements, where the first element is the approximate minimum and the last element is the approximate maximum.

If I understand correctly, you can use ntile(). For instance, if you wanted a value from 1-4, you can do:

select t.*, ntile(4) over (order by score) as tile
from t;

If you want to enumerate the values, then use rank() or dense_rank():

select t.*, rank() over (order by score) as tile
from t;

I see, your problem is getting the code to work, because BigQuery tends to run out of resources without a partition by. One method is to break up the score into different groups. I think this logic does what you want:

select *, 
       ( (count(*) over (partition by cast(score / 1000 as int64) order by cast(score / 1000 as int64)) -
          count(*) over (partition by cast(score / 1000 as int64))
         ) +
         rank() over (partition by cast(score / 1000 as int64) order by regi_id)
      ) as therank,
      -- rank() over (order by score) as therank
from t;

This breaks the score into 1000 groups (perhaps that is too many for an integer). And then reconstructs the ranking.

If your score has relatively low cardinality, then join with aggregation works:

select t.*, (running_cnt - cnt + 1) as therank
from t join
     (select score, count(*) as cnt, sum(count(*)) over (order by score) as running_cnt
      from t
      group by score
     ) s
     on t.score = s.score;

Once you have the rank() (or row_number()) you can easily calculate the tiles yourself (hint: division).

Output suggest me rank() :

SELECT *, RANK() OVER (ORDER BY score) as quantile 
FROM table t
ORDER BY id;

Create column for the quantile number of a value in BigQuery

Tags:

sql

google-bigquery

bigquery-standard-sql

giov

People also ask

2 Answers

Gordon Linoff

Yogesh Sharma

Recent Activity

Donate For Us

Create column for the quantile number of a value in BigQuery

Tags:

sql

google-bigquery

bigquery-standard-sql

giov

People also ask

2 Answers

Gordon Linoff

Yogesh Sharma

Related questions

Recent Activity

Donate For Us