Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create column for the quantile number of a value in BigQuery

I have a table with two columns: id and score. I'd like to create a third column that equals the quantile that an individual's score falls in. I'd like to do this in BigQuery's standardSQL.

Here's my_table:

+----+--------+
| id | score  |
+----+--------+
|  1 |      2 |
|  2 |     13 |
|  3 |     -2 |
|  4 |      7 |
+----+--------+

and afterwards I'd like to have the following table (example shown with quartiles, but I'd be interested in quartiles/quintiles/deciles)

+----+--------+----------+
| id | score  | quaRtile |
+----+--------+----------+
|  1 |      2 |        2 |
|  2 |     13 |        4 |
|  3 |     -2 |        1 |
|  4 |      7 |        3 |
+----+--------+----------+

It would be excellent if this were to work on 100 million rows. I've looked around to see a couple solutions that seem to use legacy sql, and the solutions using RANK() functions don't seem to work for really large datasets. Thanks!

like image 844
giov Avatar asked Aug 30 '18 19:08

giov


People also ask

How do you find the percentile of a BigQuery?

To get percentiles, simply ask for 100 quantiles. select percentiles[offset(10)] as p10, percentiles[offset(25)] as p25, percentiles[offset(50)] as p50, percentiles[offset(75)] as p75, percentiles[offset(90)] as p90, from ( select approx_quantiles(char_length(text), 100) percentiles from `bigquery-public-data.

What is Rownum in BigQuery?

ROW_NUMBER() function in BigQuery. Row_number is a Numbering function which is a subset of Analytic function in BigQuery. In the analytic function, the OVER clause should be included to define a window of rows within a query result set. For each selected window of rows, Row_number function assigns a unique number.

What is Approx_quantiles?

APPROX_QUANTILES Description Returns the approximate boundaries for a group of expression values, where number represents the number of quantiles to create. This function returns an array of number + 1 elements, where the first element is the approximate minimum and the last element is the approximate maximum.


2 Answers

If I understand correctly, you can use ntile(). For instance, if you wanted a value from 1-4, you can do:

select t.*, ntile(4) over (order by score) as tile
from t;

If you want to enumerate the values, then use rank() or dense_rank():

select t.*, rank() over (order by score) as tile
from t;

I see, your problem is getting the code to work, because BigQuery tends to run out of resources without a partition by. One method is to break up the score into different groups. I think this logic does what you want:

select *, 
       ( (count(*) over (partition by cast(score / 1000 as int64) order by cast(score / 1000 as int64)) -
          count(*) over (partition by cast(score / 1000 as int64))
         ) +
         rank() over (partition by cast(score / 1000 as int64) order by regi_id)
      ) as therank,
      -- rank() over (order by score) as therank
from t;

This breaks the score into 1000 groups (perhaps that is too many for an integer). And then reconstructs the ranking.

If your score has relatively low cardinality, then join with aggregation works:

select t.*, (running_cnt - cnt + 1) as therank
from t join
     (select score, count(*) as cnt, sum(count(*)) over (order by score) as running_cnt
      from t
      group by score
     ) s
     on t.score = s.score;

Once you have the rank() (or row_number()) you can easily calculate the tiles yourself (hint: division).

like image 181
Gordon Linoff Avatar answered Oct 18 '22 23:10

Gordon Linoff


Output suggest me rank() :

SELECT *, RANK() OVER (ORDER BY score) as quantile 
FROM table t
ORDER BY id;
like image 1
Yogesh Sharma Avatar answered Oct 18 '22 23:10

Yogesh Sharma