Why are SQL aggregate functions so much slower than Python and Java (or Poor Man's OLAP)

Tags:

I need a real DBA's opinion. Postgres 8.3 takes 200 ms to execute this query on my Macbook Pro while Java and Python perform the same calculation in under 20 ms (350,000 rows):

SELECT count(id), avg(a), avg(b), avg(c), avg(d) FROM tuples;

Is this normal behaviour when using a SQL database?

The schema (the table holds responses to a survey):

CREATE TABLE tuples (id integer primary key, a integer, b integer, c integer, d integer);

\copy tuples from '350,000 responses.csv' delimiter as ','

I wrote some tests in Java and Python for context and they crush SQL (except for pure python):

java   1.5 threads ~ 7 ms    
java   1.5         ~ 10 ms    
python 2.5 numpy   ~ 18 ms  
python 2.5         ~ 370 ms

Even sqlite3 is competitive with Postgres despite it assumping all columns are strings (for contrast: even using just switching to numeric columns instead of integers in Postgres results in 10x slowdown)

Tunings i've tried without success include (blindly following some web advice):

increased the shared memory available to Postgres to 256MB    
increased the working memory to 2MB
disabled connection and statement logging
used a stored procedure via CREATE FUNCTION ... LANGUAGE SQL

So my question is, is my experience here normal, and this is what I can expect when using a SQL database? I can understand that ACID must come with costs, but this is kind of crazy in my opinion. I'm not asking for realtime game speed, but since Java can process millions of doubles in under 20 ms, I feel a bit jealous.

Is there a better way to do simple OLAP on the cheap (both in terms of money and server complexity)? I've looked into Mondrian and Pig + Hadoop but not super excited about maintaining yet another server application and not sure if they would even help.

No the Python code and Java code do all the work in house so to speak. I just generate 4 arrays with 350,000 random values each, then take the average. I don't include the generation in the timings, only the averaging step. The java threads timing uses 4 threads (one per array average), overkill but it's definitely the fastest.

The sqlite3 timing is driven by the Python program and is running from disk (not :memory:)

I realize Postgres is doing much more behind the scenes, but most of that work doesn't matter to me since this is read only data.

The Postgres query doesn't change timing on subsequent runs.

I've rerun the Python tests to include spooling it off the disk. The timing slows down considerably to nearly 4 secs. But I'm guessing that Python's file handling code is pretty much in C (though maybe not the csv lib?) so this indicates to me that Postgres isn't streaming from the disk either (or that you are correct and I should bow down before whoever wrote their storage layer!)

667

asked Sep 09 '08 10:09

Jacob Rigby

2 Answers

I would say your test scheme is not really useful. To fulfill the db query, the db server goes through several steps:

parse the SQL
work up a query plan, i. e. decide on which indices to use (if any), optimize etc.
if an index is used, search it for the pointers to the actual data, then go to the appropriate location in the data or
if no index is used, scan the whole table to determine which rows are needed
load the data from disk into a temporary location (hopefully, but not necessarily, memory)
perform the count() and avg() calculations

So, creating an array in Python and getting the average basically skips all these steps save the last one. As disk I/O is among the most expensive operations a program has to perform, this is a major flaw in the test (see also the answers to this question I asked here before). Even if you read the data from disk in your other test, the process is completely different and it's hard to tell how relevant the results are.

To obtain more information about where Postgres spends its time, I would suggest the following tests:

Compare the execution time of your query to a SELECT without the aggregating functions (i. e. cut step 5)
If you find that the aggregation leads to a significant slowdown, try if Python does it faster, obtaining the raw data through the plain SELECT from the comparison.

To speed up your query, reduce disk access first. I doubt very much that it's the aggregation that takes the time.

There's several ways to do that:

Cache data (in memory!) for subsequent access, either via the db engine's own capabilities or with tools like memcached
Reduce the size of your stored data
Optimize the use of indices. Sometimes this can mean to skip index use altogether (after all, it's disk access, too). For MySQL, I seem to remember that it's recommended to skip indices if you assume that the query fetches more than 10% of all the data in the table.
If your query makes good use of indices, I know that for MySQL databases it helps to put indices and data on separate physical disks. However, I don't know whether that's applicable for Postgres.
There also might be more sophisticated problems such as swapping rows to disk if for some reason the result set can't be completely processed in memory. But I would leave that kind of research until I run into serious performance problems that I can't find another way to fix, as it requires knowledge about a lot of little under-the-hood details in your process.

Update:

I just realized that you seem to have no use for indices for the above query and most likely aren't using any, too, so my advice on indices probably wasn't helpful. Sorry. Still, I'd say that the aggregation is not the problem but disk access is. I'll leave the index stuff in, anyway, it might still have some use.

answered Sep 21 '22 22:09

Hanno Fietz

Postgres is doing a lot more than it looks like (maintaining data consistency for a start!)

If the values don't have to be 100% spot on, or if the table is updated rarely, but you are running this calculation often, you might want to look into Materialized Views to speed it up.

(Note, I have not used materialized views in Postgres, they look at little hacky, but might suite your situation).

Materialized Views

Also consider the overhead of actually connecting to the server and the round trip required to send the request to the server and back.

I'd consider 200ms for something like this to be pretty good, A quick test on my oracle server, the same table structure with about 500k rows and no indexes, takes about 1 - 1.5 seconds, which is almost all just oracle sucking the data off disk.

The real question is, is 200ms fast enough?

-------------- More --------------------

I was interested in solving this using materialized views, since I've never really played with them. This is in oracle.

First I created a MV which refreshes every minute.

create materialized view mv_so_x 
build immediate 
refresh complete 
START WITH SYSDATE NEXT SYSDATE + 1/24/60
 as select count(*),avg(a),avg(b),avg(c),avg(d) from so_x;

While its refreshing, there is no rows returned

SQL> select * from mv_so_x;

no rows selected

Elapsed: 00:00:00.00

Once it refreshes, its MUCH faster than doing the raw query

SQL> select count(*),avg(a),avg(b),avg(c),avg(d) from so_x;

  COUNT(*)     AVG(A)     AVG(B)     AVG(C)     AVG(D)
---------- ---------- ---------- ---------- ----------
   1899459 7495.38839 22.2905454 5.00276131 2.13432836

Elapsed: 00:00:05.74
SQL> select * from mv_so_x;

  COUNT(*)     AVG(A)     AVG(B)     AVG(C)     AVG(D)
---------- ---------- ---------- ---------- ----------
   1899459 7495.38839 22.2905454 5.00276131 2.13432836

Elapsed: 00:00:00.00
SQL>

If we insert into the base table, the result is not immediately viewable view the MV.

SQL> insert into so_x values (1,2,3,4,5);

1 row created.

Elapsed: 00:00:00.00
SQL> commit;

Commit complete.

Elapsed: 00:00:00.00
SQL> select * from mv_so_x;

  COUNT(*)     AVG(A)     AVG(B)     AVG(C)     AVG(D)
---------- ---------- ---------- ---------- ----------
   1899459 7495.38839 22.2905454 5.00276131 2.13432836

Elapsed: 00:00:00.00
SQL>

But wait a minute or so, and the MV will update behind the scenes, and the result is returned fast as you could want.

SQL> /

  COUNT(*)     AVG(A)     AVG(B)     AVG(C)     AVG(D)
---------- ---------- ---------- ---------- ----------
   1899460 7495.35823 22.2905352 5.00276078 2.17647059

Elapsed: 00:00:00.00
SQL>

This isn't ideal. for a start, its not realtime, inserts/updates will not be immediately visible. Also, you've got a query running to update the MV whether you need it or not (this can be tune to whatever time frame, or on demand). But, this does show how much faster an MV can make it seem to the end user, if you can live with values which aren't quite upto the second accurate.

answered Sep 20 '22 22:09

Matthew Watson

Related questions
                            
                                In Tkinter, How I disable Entry?
                            
                                Can I POST data with python requests lib with http-gzip or deflate compression?
                            
                                Python - OSError: [WinError 17] The system cannot move the file to a different disk drive:
                            
                                Drawing multiple edges between two nodes with networkx
                            
                                Write dictionary values in an excel file
                            
                                How to calculate percentage with Pandas' DataFrame
                            
                                Pyplot: using percentage on x axis
                            
                                Nginx Django and Gunicorn. Gunicorn sock file is missing?
                            
                                How do I use within / in operator in a Pandas DataFrame? [duplicate]
                            
                                Install gdal using conda?
                            
                                Calculating cumulative returns with pandas dataframe
                            
                                Pandas Counting Unique Rows
                            
                                Splitting a list into uneven groups?
                            
                                How to measure the speed of a python function
                            
                                Creating a Gin Index with Trigram (gin_trgm_ops) in Django model
                            
                                How to re-partition pyspark dataframe?
                            
                                KeyError when loading pickled scikit-learn model using joblib
                            
                                Why can't I get reproducible results in Keras even though I set the random seeds?
                            
                                Create random shape/contour using matplotlib
                            
                                How to fix AttributeError: 'Series' object has no attribute 'to_numpy'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why are SQL aggregate functions so much slower than Python and Java (or Poor Man's OLAP)

Tags:

python

sql

optimization

aggregate

olap

Jacob Rigby

People also ask

2 Answers

Hanno Fietz

Matthew Watson

Recent Activity

Donate For Us