Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding "Resources exceeded during query execution" with GROUP EACH BY in BigQuery

I'm writing a background job to automatically process A/B test data in BigQuery, and I'm finding that I'm hitting "Resources exceeded during query execution" when doing large GROUP EACH BY statements. I saw from Resources Exceeded during query execution that reducing the number of groups can make queries succeed, so I split up my data into smaller pieces, but I'm still hitting errors (although less frequently). It would be nice to get a better intuition about what actually causes this error. In particular:

  • Does "resources exceeded" always mean that a shard ran out of memory, or could it also mean that the task ran out of time?
  • What's the right way to approximate the memory usage and the total memory I have available? Am I correct in assuming each shard tracks about 1/n of the groups and keeps the group key and all aggregates for each group, or is there another way that I should be thinking about it?
  • How is the number of shards determined? In particular, do I get fewer shards/resources if I'm querying over a smaller dataset?

The problematic query looks like this (in practice, it's used as a subquery, and the outer query aggregates the results):

SELECT
    alternative,
    snapshot_time,
    SUM(column_1),
    ...
    SUM(column_139)
FROM
        my_table
    CROSS JOIN
        [table containing 24 unix timestamps] timestamps
WHERE last_updated_time < timestamps.snapshot_time
GROUP EACH BY alternative, user_id, snapshot_time

(Here's an example failed job: 124072386181:job_XF6MksqoItHNX94Z6FaKpuktGh4 )

I realize this query may be asking for trouble, but in this case, the table is only 22MB and the query results in under a million groups and it's still failing with "resources exceeded". Reducing the number of timestamps to process at once fixes the error, but I'm worried that I'll eventually hit a data scale large enough that this approach as a whole will stop working.

like image 654
Alan Pierce Avatar asked Mar 24 '14 02:03

Alan Pierce


1 Answers

As you've guessed, BigQuery chooses a number of parallel workers (shards) for GROUP EACH and JOIN EACH queries based on the size of the tables being operated upon. It is a rough heuristic, but in practice, it works pretty well.

What is interesting about your query is that the GROUP EACH is being done over a larger table than the original table because of the expansion in the CROSS JOIN. Because of this, we choose a number of shards that is too small for your query.

To answer your specific questions:

  • Resources exceeded almost always means that a worker ran out of memory. This could be a shard or a mixer, in Dremel terms (mixers are the nodes in the computation tree that aggregate results. GROUP EACH BY pushes aggregation down to the shards, which are the leaves of the computation tree).

  • There isn't a good way to approximate the amount of resources available. This changes over time, with the goal that more of your queries should just work.

  • The number of shards is determined by the total bytes processed in the query. As you've noticed, this heuristic doesn't work well with joins that expand the underlying data sets. That said, there is active work underway to be smarter about how we pick the number of shards. To give you an idea of scale, your query got scheduled on only 20 shards, which is a tiny fraction of what a larger table would get.

As a workaround, you could save the intermediate result of the CROSS JOIN as a table, and running the GROUP EACH BY over that temporary table. That should let BigQuery use the expanded size when picking the number of shards. (if that doesn't work, please let me know, it is possible that we need to tweak our assignment thresholds).

like image 71
Jordan Tigani Avatar answered Sep 29 '22 19:09

Jordan Tigani