Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to count frequency of elements in a bigquery array field

I have a table that looks like this:

enter image description here

I am looking for a table that gives a frequency count of the elements in the fields l_0, l_1, l_2, l_3.

For example the output should look like this:

| author_id  | year | l_o.name         | l_0.count| l1.name    | l1.count | l2.name             | l2.count| l3.name            | l3.count|
| 2164089123 | 1987 | biology          | 3        | botany     | 3        |                     |         |                    |         |
| 2595831531 | 1987 | computer science | 2        | simulation | 2        | computer simulation | 2       | mathematical model | 2       |

Edit:

In some cases the array field might have more than one type of element. For example l_0 could be ['biology', 'biology', 'geometry', 'geometry']. In that case the output for fields l_0, l_1, l_2, l_3 would be a nested repeated field with all the elements in l_0.name and all the corresponding counts in the l_0.count.

like image 598
Syed Arefinul Haque Avatar asked Jan 23 '18 21:01

Syed Arefinul Haque


People also ask

What is the use of count BigQuery?

The use of the COUNT function in Bigquery is to return a single value from the number of rows in the input. The DISTINCT clause with COUNT is used only to eliminate any duplicate row in the table.

What does UNNEST do in BigQuery?

To convert an ARRAY into a set of rows, also known as "flattening," use the UNNEST operator. UNNEST takes an ARRAY and returns a table with a single row for each element in the ARRAY . Because UNNEST destroys the order of the ARRAY elements, you may wish to restore order to the table.

How to convert array to STRING in BigQuery?

BigQuery ARRAY_TO_STRING(): Converting an Array to a String You can use the ARRAY_TO_STRING() function to convert a single ARRAY<STRING> to a single STRING value.


1 Answers

This should work, assuming you want to count on a per-array basis:

SELECT
  author_id,
  year,
  (SELECT AS STRUCT ANY_VALUE(l_0) AS name, COUNT(*) AS count
   FROM UNNEST(l_0) AS l_0) AS l_0,
  (SELECT AS STRUCT ANY_VALUE(l_1) AS name, COUNT(*) AS count
   FROM UNNEST(l_1) AS l_1) AS l_1,
  (SELECT AS STRUCT ANY_VALUE(l_2) AS name, COUNT(*) AS count
   FROM UNNEST(l_2) AS l_2) AS l_2,
  (SELECT AS STRUCT ANY_VALUE(l_3) AS name, COUNT(*) AS count
   FROM UNNEST(l_3) AS l_3) AS l_3
FROM YourTable;

To avoid so much repetition, you can make use of a SQL UDF:

CREATE TEMP FUNCTION GetNameAndCount(elements ARRAY<STRING>) AS (
  (SELECT AS STRUCT ANY_VALUE(elem) AS name, COUNT(*) AS count
   FROM UNNEST(elements) AS elem)
);

SELECT
  author_id,
  year,
  GetNameAndCount(l_0) AS l_0,
  GetNameAndCount(l_1) AS l_1,
  GetNameAndCount(l_2) AS l_2,
  GetNameAndCount(l_3) AS l_3
FROM YourTable;

If you potentially need to account for multiple different names within an array, you can have the UDF return an array of them with associated counts instead:

CREATE TEMP FUNCTION GetNamesAndCounts(elements ARRAY<STRING>) AS (
  ARRAY(
    SELECT AS STRUCT elem AS name, COUNT(*) AS count
    FROM UNNEST(elements) AS elem
    GROUP BY elem
    ORDER BY count
  )
);

SELECT
  author_id,
  year,
  GetNamesAndCounts(l_0) AS l_0,
  GetNamesAndCounts(l_1) AS l_1,
  GetNamesAndCounts(l_2) AS l_2,
  GetNamesAndCounts(l_3) AS l_3
FROM YourTable;

Note that if you want to perform counting across rows, however, you'll need to unnest the arrays and perform the GROUP BY at the outer level, but it doesn't look like this is your intention based on the question.

like image 58
Elliott Brossard Avatar answered Sep 21 '22 23:09

Elliott Brossard