Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the most frequent value in Google's Bigquery

Postgres has an easy function to achieve this, just using the mode() function we can find the most frequent value. Is there something equivalent within Google's Bigquery?

How could be written a query like this in Bigquery?

select count(*),
       avg(vehicles)                                         as mean,
       percentile_cont(0.5) within group (order by vehicles) as median,
       mode() within group (order by vehicles)               as most_frequent_value
FROM "driver"
WHERE vehicles is not null;
like image 941
Ander Avatar asked Dec 17 '22 17:12

Ander


2 Answers

Below is for BigQuery Standard SQL

Option 1

#standardSQL
SELECT * FROM (
  SELECT COUNT(*) AS cnt,
    AVG(vehicles) AS mean,
    APPROX_TOP_COUNT(vehicles, 1)[OFFSET(0)].value AS most_frequent_value
  FROM `project.dataset.table`
  WHERE vehicles IS NOT NULL
) CROSS JOIN (
  SELECT PERCENTILE_CONT(vehicles, 0.5) OVER() AS median
  FROM `project.dataset.table`
  WHERE vehicles IS NOT NULL
  LIMIT 1
)

Option 2

#standardSQL
SELECT * FROM (
  SELECT COUNT(*) cnt,
    AVG(vehicles) AS mean
  FROM `project.dataset.table`
  WHERE vehicles IS NOT NULL
) CROSS JOIN (
  SELECT PERCENTILE_CONT(vehicles, 0.5) OVER() AS median
  FROM `project.dataset.table`
  WHERE vehicles IS NOT NULL
  LIMIT 1
) CROSS JOIN (
  SELECT vehicles AS most_frequent_value
  FROM `project.dataset.table`
  WHERE vehicles IS NOT NULL
  GROUP BY vehicles
  ORDER BY COUNT(1) DESC
  LIMIT 1
)  

Option 3

#standardSQL
CREATE TEMP FUNCTION median(arr ANY TYPE) AS ((
  SELECT PERCENTILE_CONT(x, 0.5) OVER() 
  FROM UNNEST(arr) x LIMIT 1 
));
CREATE TEMP FUNCTION most_frequent_value(arr ANY TYPE) AS ((
  SELECT x 
  FROM UNNEST(arr) x
  GROUP BY x
  ORDER BY COUNT(1) DESC
  LIMIT 1  
));
SELECT COUNT(*) cnt,
  AVG(vehicles) AS mean,
  median(ARRAY_AGG(vehicles)) AS median,
  most_frequent_value(ARRAY_AGG(vehicles)) AS most_frequent_value
FROM `project.dataset.table`
WHERE vehicles IS NOT NULL   

and so on ...

like image 183
Mikhail Berlyant Avatar answered Dec 20 '22 05:12

Mikhail Berlyant


You can use APPROX_TOP_COUNT to get top values, e.g.:

SELECT APPROX_TOP_COUNT(vehicles, 5) AS top_five_vehicles
FROM dataset.driver

If you just want the top value, you can select it from the array:

SELECT APPROX_TOP_COUNT(vehicles, 1)[OFFSET(0)] AS most_frequent_value
FROM dataset.driver
like image 26
Elliott Brossard Avatar answered Dec 20 '22 07:12

Elliott Brossard