Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Google BigQuery APPROX_QUANTILES and getting true quartiles

According to the docs:

Returns the approximate boundaries for a group of expression values, where number represents the number of quantiles to create. This function returns an array of number + 1 elements, where the first element is the approximate minimum and the last element is the approximate maximum.

It sounds like if I want true quartiles, I need to use APPROX_QUANTILES(values, 4) which will return [minvalue, 1st quartile, 2nd quartile, 3rd quartile, maxvalue]

As according to https://en.wikipedia.org/wiki/Quartile, quartile sets contain 3 data points - none of which is the min/max values of the data.

Is my assumption correct? Is APPROX_QUANTILES(values, 4) going to return the true quartiles?

like image 313
Tyler_1 Avatar asked Jan 18 '18 17:01

Tyler_1


1 Answers

As a baseline, this is the output without any modification, using an input of numbers between 1 and 100:

SELECT APPROX_QUANTILES(x, 4) AS output
FROM UNNEST(GENERATE_ARRAY(1, 100)) AS x;
+----------------------------+
|           output           |
+----------------------------+
| ["1","25","50","75","100"] |
+----------------------------+

The output includes both the minimum (1) and the maximum (100). If you just want the quartiles, you need to strip them from the array. For the sake of readability/composability, it's best to do this using a temporary SQL UDF. Here I'm using INT64 for the element type, but you could have a different element type, alternatively:

CREATE TEMP FUNCTION StripFirstLast(arr ARRAY<INT64>) AS (
  ARRAY(SELECT x FROM UNNEST(arr) AS x WITH OFFSET
  WHERE OFFSET BETWEEN 1 AND ARRAY_LENGTH(arr) - 2)
);

SELECT
  APPROX_QUANTILES(x, 4) AS output,
  StripFirstLast(APPROX_QUANTILES(x, 4)) AS quartiles
FROM UNNEST(GENERATE_ARRAY(1, 100)) AS x;
+----------------------------+------------------+
|           output           |    quartiles     |
+----------------------------+------------------+
| ["1","25","50","75","100"] | ["25","50","75"] |
+----------------------------+------------------+

You can see that the quartiles array contains only the desired values.

like image 99
Elliott Brossard Avatar answered Oct 19 '22 01:10

Elliott Brossard