Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Query BigQuery nested/repeated fields

I am new to BigQuery (and sql) and am trying to setup a table where each row contains a message column (string) and metadata column (repeated). I want to use the metadata column so that I can query for messages that match certain metadata parameters.

For example, my message string is:

Hi honey, I'm home!

and the metadata column is arbitrary key/value pairs that describe the message:

{"category": "personal", "message_type": "salutation"}

This would allow me to ask bigquery to:

return me all messages which have a category of personal (or message_type of personal, etc)

To reiterate, the metadata keys will potentially be different for every message row that is inserted (e.g. a new row may have a metadata key of favorite_color instead of category).

Here is the schema that I came up with in order to support the metadata flexibility that I desire:

[
{"name": "message", "type": "string", "mode": "required"}
{"name": "metadata", "type": "record", "mode": "repeated", "fields": [ {"name": "key", 
                                                                        "type": "string", 
                                                                        "mode": "required"},
                                                                       {"name": "value", 
                                                                        "type": "string", 
                                                                        "mode": "required"}]}
]

The schema expects that every metadata entry/row contain a column called key and column called value. It seems to work fine.

Interestingly, Bigquery represents my message row as two rows (which I guess is called "flattening"?):

message                 | metadata_key    |    metadata_value
------------------------------------------------------------
Hi honey, I'm home!       category             personal
Hi honey, I'm home!       message_type         salutation

I figured out how to query against a single metadata value such as category=personal:

SELECT * FROM [table.test] 
WHERE
metadata.key="category" AND metadata.value="personal"

I get a single row back:

message                 | metadata_key    |    metadata_value
------------------------------------------------------------
Hi honey, I'm home!       category             personal

This is great! However, I don't know how to construct a more complex query that matches against multiple metadata parameters, such as:

Fetch me all messages which have a category of personal and message_type of salutation

This following query doesn't return anything:

SELECT * FROM [table.test] 
WHERE
metadata.key="category" AND metadata.value="personal" AND
metadata.key="message_type" AND metadata.value="salutation" 

I assume this is because bigquery is breaking down my single message row into two rows (one for each nested metadata row). This obviously means that my query will never be able to match a single row because I am trying to match against two different metadata rows (i.e. I'm asking for metadata.key to be equal to two different values at the same time, and metadata.value to be equal to different values at the same time). So I'm somewhat stumped as to how to construct my query appropriately.

Ideally I would like BigQuery to return a single row (the original example row that I started with), since it satisfies the two metadata requirements of my query.

I assume I need some combination of GROUP BY, FLATTEN, WITHIN, unflatten, etc. The combination of possibilities/structure are somewhat daunting for this sql/bigquery novice. Any help appreciated :)

like image 988
Lawrence Avatar asked Mar 29 '16 23:03

Lawrence


People also ask

How do you query a repeated field in BigQuery?

How to Query BigQuery Repeated Fields. To extract information from a repeated field in BigQuery, you must use a more exotic pattern. This is normally done using the UNNEST function, which converts an array of values in a table into rows. These can then be joined to the original table to be queried.

What is repeated fields in BigQuery?

A repeated field can be accessed as an ARRAY type in standard SQL. A RECORD column can have REPEATED mode, which is represented as an array of STRUCT types. Also, a field within a record can be repeated, which is represented as a STRUCT that contains an ARRAY . An array cannot contain another array directly.

Can we use loop in BigQuery?

This is just a loop over every day, week, etc on the same table, and then a COUNT on the distinct events happening within that period of time. But we can't do 'loops' in BigQuery.

How do you use the Unnest function in BigQuery?

To convert an ARRAY into a set of rows, also known as "flattening," use the UNNEST operator. UNNEST takes an ARRAY and returns a table with a single row for each element in the ARRAY . Because UNNEST destroys the order of the ARRAY elements, you may wish to restore order to the table.


1 Answers

Try below - just one of options
It sums how many "matches you have in each row and if all (in this cases - 2) it outputs it. Pretty simple

SELECT *, 
SUM((metadata.key="category" AND metadata.value="personal")
 OR (metadata.key="message_type" AND metadata.value="salutation")
) WITHIN RECORD AS check,
FROM [table.test] 
HAVING check = 2

Cleaner version of this would be as below

SELECT * 
FROM [table.test]
OMIT RECORD IF 
SUM((metadata.key="category" AND metadata.value="personal")
 OR (metadata.key="message_type" AND metadata.value="salutation")
) != 2    
like image 189
Mikhail Berlyant Avatar answered Sep 21 '22 13:09

Mikhail Berlyant