How to get the First N values of multiple columns?

Question

Let's say I have the following table:

WITH tbl AS (
    SELECT 1 AS id, "Phone" AS product, 105 AS cost UNION ALL
    SELECT 2 AS id, "Camera" AS product, 82 AS cost UNION ALL
    SELECT 3 AS id, "Cup" AS product, 103 AS cost
) SELECT * FROM tbl

How would I get N distinct values for each column? For example, similar to the "Value distribution" (bottom right) from Power BI shown here:

enter image description here

I don't care about the counts of each value, but just to pull a representative sample of values (say, up to ten values).

For example, to display a sample of values without having to run a query for each column? That is, I'd like to grab them all in one-go. So far I have something like:

WITH tbl AS (
    SELECT 1 AS id, 'Phone' AS product, 105 AS cost UNION ALL
    SELECT 2 AS id, 'Camera' AS product, 82 AS cost UNION ALL
    SELECT 3 AS id, 'Cup' AS product, 103 AS cost
) 
SELECT 
    ARRAY_AGG(DISTINCT id LIMIT 2),
    ARRAY_AGG(DISTINCT product LIMIT 2),
    ARRAY_AGG(DISTINCT cost LIMIT 2)
FROM tbl

This works, but it seems very inefficient (I believe the same as running a query for each column). What is a better way to do this?

Or, to generalize what I think to be a poor approach but applicable outside BQ:

WITH tbl AS (
    SELECT 1 AS id, 'Phone' AS product, 105 AS cost UNION ALL
    SELECT 2 AS id, 'Camera' AS product, 82 AS cost UNION ALL
    SELECT 3 AS id, 'Cup' AS product, 103 AS cost
)  
select 'id' as field, array(select distinct cast(id as string) from tbl limit 2) as values union all
select 'product', array(select distinct cast(product as string) from tbl limit 2) union all
select 'cost', array(select distinct cast(cost as string) from tbl limit 2);

And a further improvement based on Mikhail's answer:

WITH tbl AS (
    SELECT 1 AS id, "Phone" AS product, 105 AS cost, true as is_big, date '2014-01-01' as d UNION ALL
    SELECT 2 AS id, "Camera" AS product, 82 AS cost, false as is_big, date '2017-01-01' as d UNION ALL
    SELECT 3 AS id, "Cup" AS product, 103 AS cost, false as is_big, date '2015-01-01' as d union all
    SELECT 7 AS id, "Several" AS product, 103 AS cost, true as is_big, date '2016-01-01' as d
) 
SELECT 
  name, 
  IF(
    array_length(quantiles) is not null, 
    ARRAY(SELECT CAST(tmp AS STRING) FROM UNNEST(quantiles) tmp), 
    ARRAY(SELECT value FROM t.top_values)
  ) values
FROM ML.DESCRIBE_DATA(
  (SELECT * FROM tbl), STRUCT(3 AS num_quantiles, 4 AS top_k)
) t;

Mikhail Berlyant · Accepted Answer

Consider below as an option (BigQuery) - fast and cheap and have no dependency on number of columns!!

WITH tbl AS (
    SELECT 1 AS id, "Phone" AS product, 105 AS cost UNION ALL
    SELECT 2 AS id, "Camera" AS product, 82 AS cost UNION ALL
    SELECT 3 AS id, "Cup" AS product, 103 AS cost
) 
SELECT name, quantiles AS num_values, ARRAY(SELECT value FROM t.top_values) AS string_values
FROM ML.DESCRIBE_DATA(
  (SELECT * FROM tbl),
  STRUCT(1 AS num_quantiles, 2 AS top_k)
) t;

enter image description here

How to get the First N values of multiple columns?

Tags:

sql

postgresql

google-bigquery

David542

1 Answers

Mikhail Berlyant

Recent Activity

Donate For Us

How to get the First N values of multiple columns?

Tags:

sql

postgresql

google-bigquery

David542

1 Answers

Mikhail Berlyant

Related questions

Recent Activity

Donate For Us