Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is self-join the way to go on BigQuery when fetching data from multiple repeated fields?

Consider this schema:

key: REQUIRED INTEGER
description: NULLABLE STRING
field: REPEATED RECORD {
    field.names: REQUIRED STRING
    field.value: NULLABLE FLOAT
}

Where: key is unique by table, field.names is actually a comma-separated list of properties ("property1","property2","property3"...).

Sample dataset (don't pay attention to the actual values, they are only for demonstration of the structure):

{"key":1,"description":"Cool","field":[{"names":"\"Nice\",\"Wonderful\",\"Woohoo\"", "value":1.2},{"names":"\"Everything\",\"is\",\"Awesome\"", "value":20}]}
{"key":2,"description":"Stack","field":[{"names":"\"Overflow\",\"Exchange\",\"Nice\"", "value":2.0}]}
{"key":3,"description":"Iron","field":[{"names":"\"The\",\"Trooper\"", "value":666},{"names":"\"Aces\",\"High\",\"Awesome\"", "value":333}]}

What I need is a way to query for the values of multiple field.names at once. The output should be like this:

+-----+--------+-------+-------+-------+-------+
| key |  desc  | prop1 | prop2 | prop3 | prop4 |
+-----+--------+-------+-------+-------+-------+
| 1   | Desc 1 | 1.0   | 2.0   | 3.0   | 4.0   |
| 2   | Desc 2 | 4.0   | 3.0   | 2.0   | 1.0   |
| ... |        |       |       |       |       |
+-----+--------+-------+-------+-------+-------+

If the same key contains fields with the same queried name, only the first value should be considered.

And here is my query so far:

select all.key as key, all.description as desc, 
t1.col as prop1, t2.col as prop2, t3.col as prop3 //and so on...

from mydataset.mytable all

left join each 
(select key, field.value as col from 
mydataset.mytable
where lower(field.names) contains '"trooper"'
group each by key, col
) as t1 on all.key = t1.key

left join each 
(select key, field.value as col from 
mydataset.mytable
where lower(field.names) contains '"awesome"'
group each by key, col
) as t2 on all.key = t2.key

left join each 
(select key, field.value as col from 
mydataset.mytable
where lower(field.names) contains '"nice"'
group each by key, col
) as t3 on all.key = t3.key

//and so on...

The output of this query would be:

+-----+-------+-------+-------+-------+
| key | desc  | prop1 | prop2 | prop3 |
+-----+-------+-------+-------+-------+
|   1 | Cool  | null  | 20.0  | 1.2   |
|   2 | Stack | null  | null  | 2.0   |
|   3 | Iron  | 666.0 | 333.0 | null  |
+-----+-------+-------+-------+-------+

So my question is: is this the way to go? If my user wants, lets say, 200 properties from my table, should I just make 200 self-joins? Is it scalable, considering the table can grow in billions of rows? Is there another way to do the same, using BigQuery?

Thanks.

like image 483
Gilberto Torrezan Avatar asked Mar 19 '23 03:03

Gilberto Torrezan


1 Answers

Generally speaking, a query with more than 50 joins can start to become problematic, particularly if you're joining large tables. Even with repeated fields, you want to try to scan your tables in one pass wherever possible.

It's useful to note that when you query a table with a repeated field, you are really querying a semi-flattened representation of that table. You can pretend that each repetition is its own row, and apply filters, expressions, and grouping accordingly.

In this case, I think you can probably get away with a single scan:

select
  key,
  desc,
  max(if(lower(field.names) contains "trooper", field.value, null))
      within record as prop1,
  max(if(lower(field.names) contains "awesome", field.value, null))
      within record as prop2,
  ...
from mydataset.mytable

In this case, each "prop" field just selects the value corresponding to each desired field name, or null if it doesn't exist, and then aggregates those results using the "max" function. I'm assuming that there's only one occurrence of a field name per key, in which case the specific aggregation function doesn't matter much, since it only exists to collapse nulls. But obviously you should swap it for something more appropriate if needed.

The "within record" syntax tells BigQuery to perform those aggregations only over the repeated fields within a record, and not across the entire table, thus eliminating the need for a "group by" clause at the end.

like image 187
Jeremy Condit Avatar answered Mar 20 '23 17:03

Jeremy Condit