Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I make this query run efficiently?

In BigQuery, we're trying to run:

SELECT day, AVG(value)/(1024*1024) FROM ( 
    SELECT value, UTC_USEC_TO_DAY(timestamp) as day, 
         PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank 
    FROM [Datastore.PerformanceDatum]
    WHERE type = "MemoryPerf"
) WHERE rank >= 0.9 AND rank <= 0.91 
GROUP BY day 
ORDER BY day desc;

which returns a relatively small amount of data. But we're getting the message:

Error: Resources exceeded during query execution. The query contained a GROUP BY operator, consider using GROUP EACH BY instead. For more details, please see https://developers.google.com/bigquery/docs/query-reference#groupby

What is making this query fail, the size of the subquery? Is there some equivalent query we can do which avoids the problem?


Edit in response to comments: If I add GROUP EACH BY (and drop the outer ORDER BY), the query fails, claiming GROUP EACH BY is here not parallelizable.

like image 663
Zachary Vance Avatar asked Dec 06 '25 06:12

Zachary Vance


1 Answers

I wrote an equivalent query that works for me:

SELECT day, AVG(value)/(1024*1024) FROM (
SELECT data value, UTC_USEC_TO_DAY(dtimestamp) as day, 
         PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank 
    FROM [io_sensor_data.moscone_io13]
    WHERE sensortype = "humidity"
) WHERE rank >= 0.9 AND rank <= 0.91 
GROUP BY day 
ORDER BY day desc;

If I run only the inner query, I get 3,660,624 results. Is your dataset bigger than that?

The outer select gives me only 4 results when grouped by day. I'll try a different grouping to see if I can hit a limit there:

SELECT day, AVG(value)/(1024*1024) FROM (
SELECT data value, dtimestamp / 1000 as day, 
         PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank 
    FROM [io_sensor_data.moscone_io13]
    WHERE sensortype = "humidity"
) WHERE rank >= 0.9 AND rank <= 0.91 
GROUP BY day 
ORDER BY day desc;

Runs too, now with 57,862 different groups.

I tried different combinations to get to the same error. I was able to get the same error as you doubling the amount of initial data. An easy "hack" to double the amount of data is changing:

    FROM [io_sensor_data.moscone_io13]

To:

    FROM [io_sensor_data.moscone_io13], [io_sensor_data.moscone_io13]

Then I get the same error. How much data do you have? Can you apply an additional filter? As you are already partitioning the percentile_rank by day, can you add an additional query to only analyze a fraction of the days (for example, only last month)?

like image 123
Felipe Hoffa Avatar answered Dec 09 '25 01:12

Felipe Hoffa