Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What causes "resources exceeded" in BigQuery?

My query failed with the error "resources exceeded". What causes this error, and how can I fix it?

like image 935
Jeremy Condit Avatar asked Jun 12 '14 18:06

Jeremy Condit


1 Answers

Update (2016-03-16): For most queries, EACH is no longer required, and may actually increase the likelihood of seeing this error. If you omit the EACH keyword from every JOIN and GROUP BY in your query, the query engine will now dynamically optimize your query to eliminate this error.

There are still corner cases where specifying the EACH keyword can make a query run (or run faster), but generally speaking the BigQuery team recommends that you try your query without EACH first. Pretty soon, the EACH keyword will become a complete no-op.


Original answer: When you use the EACH keyword in JOIN EACH or GROUP EACH BY, or when you use a PARTITION BY clause, BigQuery partitions ("shuffles") your data on the fly according to the join keys or group keys, which allows each worker task to perform its portion of the join or aggregation locally.

The resources exceeded error occurs when one such worker gets too much data, and run over its limit. Generally speaking, the reasons for this error fall into two categories:

  1. Skew: The data is heavily skewed toward one key value (say, a "guest" user ID or a null key), which means that one worker gets all the records for that key and gets overloaded.

  2. Mismatch in data size and worker count: You have too much data for the number of workers that BigQuery assigned your query.

We are working on a number of improvements to help us cope with both scenarios so that you don't need to worry about these issues. For now, though, you can work around the problem with one of the following approaches:

  1. Filter out skewed keys. If your data is skewed because half of your join key values are actually null, you could filter those out by adding WHERE key IS NOT NULL prior to the join.

  2. Reduce the amount of data processed. Filter each side of the join with WHERE ABS(HASH(key)) % 5 == 0 to apply the join to only 1/5 of the data (or whatever fraction you want), and then do the same for == 1, == 2, == 3, == 4 in separate queries. You're manually sharding the data in smaller chunks to make the query go through--but note that you pay 5x as much because you queried the same data 5 times.

  3. Revisit your query. Maybe you can build your query in a completely different way, or compute some intermediate results, to get the answer you want.

like image 190
Jeremy Condit Avatar answered Sep 28 '22 10:09

Jeremy Condit