Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I improve performance of average method in SQL?

I'm having some performance problems where a SQL query calculating the average of a column is progressively getting slower as the number of records grows. Is there an index type that I can add to the column that will allow for faster average calculations?

The DB in question is PostgreSQL and I'm aware that particular index type might not be available, but I'm also interested in the theoretical answer, weather this is even possible without some sort of caching solution.

To be more specific, the data in question is essentially a log with this sort of definition:

table log {
  int duration
  date time
  string event
}

I'm doing queries like

SELECT average(duration) FROM log WHERE event = 'finished'; # gets average time to completion
SELECT average(duration) FROM log WHERE event = 'finished' and date > $yesterday; # average today

The second one is always fairly fast since it has a more restrictive WHERE clause, but the total average duration one is the type of query that is causing the problem. I understand that I could cache the values, using OLAP or something, my question is weather there is a way I can do this entirely by DB side optimisations such as indices.

like image 362
Sindri Traustason Avatar asked Dec 15 '10 12:12

Sindri Traustason


3 Answers

Depends what you are doing? If you aren't filtering the data then beyond having the clustered index in order, how else is the database to calculate an average of the column?

There are systems which perform online analytical processing (OLAP) which will do things like keeping running sums and averages down the information you wish to examine. It all depends one what you are doing and your definition of "slow".

If you have a web based program for instance, perhaps you can generate an average once a minute and then cache it, serving the cached value out to users over and over again.

like image 27
Spence Avatar answered Sep 22 '22 05:09

Spence


The performance of calculating an average will always get slower the more records you have, at it always has to use values from every record in the result.

An index can still help, if the index contains less data than the table itself. Creating an index for the field that you want the average for generally isn't helpful as you don't want to do a lookup, you just want to get to all the data as efficiently as possible. Typically you would add the field as an output field in an index that is already used by the query.

like image 142
Guffa Avatar answered Sep 21 '22 05:09

Guffa


Speeding up aggregates is usually done by keeping additional tables.

Assuming sizeable table detail(id, dimA, dimB, dimC, value) if you would like to make the performance of AVG (or other aggregate functions) be nearly constant time regardless of number of records you could introduce a new table

dimAavg(dimA, avgValue)

  • The size of this table will depend only on the number of distinct values of dimA (furthermore this table could make sense in your design as it can hold the domain of the values available for dimA in detail (and other attributes related to the domain values; you might/should already have such table)
  • This table is only helpful if you will anlayze by dimA only, once you'll need AVG(value) according to dimA and dimB it becomes useless. So, you need to know by which attributes you will want to do fast analysis on. The number of rows required for keeping aggregates on multiple attributes is n(dimA) x n(dimB) x n(dimC) x ... which may or may not grow pretty quickly.
  • Maintaining this table increases the costs of updates (incl. inserts and deletes), but there are further optimizations that you can employ...

For example let us assume that system predominantly does inserts and only occasionally updates and deletes.

Lets further assume that you want to analyze by dimA only and that ids are increasing. Then having structure such as

dimA_agg(dimA, Total, Count, LastID) 

can help without a big impact on the system.

This is because you could have triggers that would not fire on every insert, but lets say on ever 100 inserts.

This way you can still get accurate aggregates from this table and the details table with

SELECT a.dimA, (SUM(d.value)+MAX(a.Total))/(COUNT(d.id)+MAX(a.Count)) as avgDimA
FROM details d INNER JOIN
     dimA_agg a ON a.dimA = d.dimA AND d.id > a.LastID 
GROUP BY a.dimA

The above query with proper indexes would get one row from dimA_agg and only less then 100 rows from detail - this would perform in near constant time (~logfanoutn) and would not require update to dimA_agg for every insert (reducing update penalties).

The value of 100 was just given as an example, you should find optimal value yourself (or even keep it variable, though triggers only will not be enough in that case).

Maintaining deletes and updates must fire on each operation but you can still inspect if the id of the record to be deleted or updated is in the stats already or not to avoid the unnecessary updates (will save some I/O).

Note: The analysis is done for the domain with discreet attributes; when dealing with time series the situation gets more complicated - you have to decide the granularity of the domain in which you want to keep the summary.

EDIT

There are also materialized views, 2, 3

like image 24
Unreason Avatar answered Sep 21 '22 05:09

Unreason