Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to increase allotted memory for queries in BigQuery?

I have a large table (about 59 millions rows, 7.1 GB) already ordered as i want, and I want to query this table and get a row_number() for each row of the table. Unfortunately I get the error

Resources exceeded during query execution: The query could not be executed in the allotted memory.

Is there a way to increase allotted memory in BigQuery ?

Here is my query, I don't see how I can simplify it, but if you have any advices I'll take it

SELECT
  row_number() over() as rowNumber,
  game,
  app_version,
  event_date,
  user_pseudo_id,
  event_name,
  event_timestamp,
  country,
  platform
FROM
`mediation_time_BASE`

Here is the complete error message :

Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 146% of limit. Top memory consumer(s): analytic OVER() clauses: 98% other/unattributed: 2%

Edit: the query here represents a list of event starts and ends, and I need to link the start event with its end, so I follow this tip : https://www.interfacett.com/blogs/how-to-use-values-from-previous-or-next-rows-in-a-query-in-sql-server/ For that I need to have the rows with row_number() in order to separate this subquery in 2 (event start in one hand and event end in the other), join them and then have one row per event with the start and end of the event, as follow (where subquery represents the query with the row_number()):

SELECT
   (case when lead(inter.rowNumber) OVER(ORDER BY inter.rowNumber) - inter.rownumber =1
          then lead(inter.rowNumber) OVER(ORDER BY inter.rowNumber)
          else inter.rownumber end) as rowNumber,
    min(inter_success.rowNumber) as rowNumber_success,
    inter.game,
    inter.app_version,
    inter.event_date,
    inter.user_pseudo_id,
    inter.event_timestamp as event_start,
    min(inter_success.event_timestamp) as event_end,
    inter_success.event_name as results
FROM
    (SELECT * FROM `subquery` where event_name = 'interstitial_fetch') as inter INNER JOIN 
    (SELECT * FROM `subquery` where event_name = 'interstitial_fetch_success') as inter_success
            ON inter.rowNumber < inter_success.rowNumber and inter.game= inter_success.game and inter.app_version = inter_success.app_version and inter.user_pseudo_id = inter_success.user_pseudo_id 
GROUP BY inter.rowNumber,inter.game,inter.app_version,inter.event_date,inter.user_pseudo_id,inter.event_timestamp,inter_success.event_name

This works fine with a smaller dataset, but doesn't for 59 million rows...

like image 478
Sophie Hamelin Avatar asked May 16 '19 09:05

Sophie Hamelin


People also ask

How much query capacity is in the free BigQuery quota?

You can export up to 50 TB of data per day from a project for free using the shared slot pool. To export more than 50 TB of data per day, do one of the following: Create a slot reservation and specify PIPELINE as the assignment type. We will bill you using flat-rate pricing.

How does slot affect the query execution in BigQuery?

Query execution under slot resource economyIf a query requests more slots than currently available, BigQuery queues up individual units of work and waits for slots to become available. As progress on query execution is made, and as slots free up, these queued up units of work get dynamically picked up for execution.

What is the best way to optimize BigQuery performance?

To further improve query performance, consider the benefits of purchasing more reserved slots, in addition to optimizing your data model and queries. BigQuery offers two pricing models for queries: on-demand pricing and flat-rate pricing. On-demand pricing is based on the amount of data processed by each query you run.


1 Answers

TL;DR: You don't need to increase the memory for BigQuery.

In order to answer that you need to understand how BigQuery works. BigQuery relies on executor machines called slots. These slots are all similar in type and have a limited amount of memory.

Now, many of the operations split the data between multiple slots (like GROUP BY), each slot performs a reduction on a portion of the data and sends the result upwards in the execution tree.

Some operations must be performed on a single machine (like SORT and OVER) see here. When your data overflows the slot's memory, you experience the described error. Hence, what you really need is to change the slot type to a higher memory machine. That's unfortunately not possible. You will have to follow the query best practices to avoid single slot operations on too much data.

One thing that may help you, is to calculate the OVER() with PARTITIONS, thus each partition will be sent to a different machine. see this example. Another thing that usually helps is to move to STANDARD SQL if you haven't done that yet.

like image 181
gidutz Avatar answered Sep 30 '22 04:09

gidutz