Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BigQuery: GROUP BY clause for QUANTILES

Based on the bigquery query reference, currently Quantiles do not allow any kind of grouping by another column. I am mainly interested in getting medians grouped by a certain column. The only work around I see right now is to generate a quantile query per distinct group member where the group member is a condition in the where clause.

For example I use the below query for every distinct row in column-y if I want to get the desired result.

SELECT QUANTILE( <column-x>, 1001)
FROM <table>
WHERE 
    <column-y> == <each distinct row in column-y>
  1. Does the big query team plan on having some functionality to allow grouping on quantiles in the future?
  2. Is there a better way to get what I am trying to get here?

Thanks

like image 971
msrivas Avatar asked Sep 19 '12 16:09

msrivas


People also ask

How do you find the 90th percentile in BigQuery?

To get percentiles, simply ask for 100 quantiles. select percentiles[offset(10)] as p10, percentiles[offset(25)] as p25, percentiles[offset(50)] as p50, percentiles[offset(75)] as p75, percentiles[offset(90)] as p90, from ( select approx_quantiles(char_length(text), 100) percentiles from `bigquery-public-data.

What is the difference between quantile and percentile?

Percentiles are given as percent values, values such as 95%, 40%, or 27%. Quantiles are given as decimal values, values such as 0.95, 0.4, and 0.27. The 0.95 quantile point is exactly the same as the 95th percentile point.

What is Approx_quantiles?

APPROX_QUANTILES( [DISTINCT] expression, number [{IGNORE|RESPECT} NULLS] ) APPROX_QUANTILE function will return the approximate boundaries or the values for a group of expression values, in which number represents the number of quantiles you want to create.


2 Answers

With the recently announced percentile_cont() window function you can get medians.

Look at the example in the announcement blog post:

http://googlecloudplatform.blogspot.com/2013/06/google-bigquery-bigger-faster-smarter-analytics-functions.html

SELECT MAX(median) AS median, room FROM (
  SELECT percentile_cont(0.5) OVER (PARTITION BY room ORDER BY data) AS median, room
  FROM [io_sensor_data.moscone_io13]
  WHERE sensortype='temperature'
)
GROUP BY room
like image 94
Felipe Hoffa Avatar answered Oct 15 '22 08:10

Felipe Hoffa


While there are efficient algorithms to compute quantiles they are somewhat memory intensive - trying to do multiple quantile calculations in a single query gets expensive.

  1. There are plans to improve QUANTILES, but I don't know what the timeline is.
  2. Do you need median? Can you filter outliers and do an average of the remainder?
like image 26
Jordan Tigani Avatar answered Oct 15 '22 09:10

Jordan Tigani