I've got a table with around 20 million rows. For arguments sake, lets say there are two columns in the table - an id and a timestamp. I'm trying to get a count of the number of items per day. Here's what I have at the moment.
SELECT DATE(timestamp) AS day, COUNT(*)
FROM actions
WHERE DATE(timestamp) >= '20100101'
AND DATE(timestamp) < '20110101'
GROUP BY day;
Without any indices, this takes about a 30s to run on my machine. Here's the explain analyze output:
GroupAggregate (cost=675462.78..676813.42 rows=46532 width=8) (actual time=24467.404..32417.643 rows=346 loops=1)
-> Sort (cost=675462.78..675680.34 rows=87021 width=8) (actual time=24466.730..29071.438 rows=17321121 loops=1)
Sort Key: (date("timestamp"))
Sort Method: external merge Disk: 372496kB
-> Seq Scan on actions (cost=0.00..667133.11 rows=87021 width=8) (actual time=1.981..12368.186 rows=17321121 loops=1)
Filter: ((date("timestamp") >= '2010-01-01'::date) AND (date("timestamp") < '2011-01-01'::date))
Total runtime: 32447.762 ms
Since I'm seeing a sequential scan, I tried to index on the date aggregate
CREATE INDEX ON actions (DATE(timestamp));
Which cuts the speed by about 50%.
HashAggregate (cost=796710.64..796716.19 rows=370 width=8) (actual time=17038.503..17038.590 rows=346 loops=1)
-> Seq Scan on actions (cost=0.00..710202.27 rows=17301674 width=8) (actual time=1.745..12080.877 rows=17321121 loops=1)
Filter: ((date("timestamp") >= '2010-01-01'::date) AND (date("timestamp") < '2011-01-01'::date))
Total runtime: 17038.663 ms
I'm new to this whole query-optimization business, and I have no idea what to do next. Any clues how I could get this query running faster?
--edit--
It looks like I'm hitting the limits of indices. This is pretty much the only query that gets run on this table (though the values of the dates change). Is there a way to partition up the table? Or create a cache table with all the count values? Or any other options?
Is there a way to partition up the table?
Yes:
http://www.postgresql.org/docs/current/static/ddl-partitioning.html
Or create a cache table with all the count values? Or any other options?
Create a "cache" table certainly is possible. But this depends on how often you need that result and how accurate it needs to be.
CREATE TABLE action_report AS SELECT DATE(timestamp) AS day, COUNT(*) FROM actions WHERE DATE(timestamp) >= '20100101' AND DATE(timestamp) < '20110101' GROUP BY day;
Then a SELECT * FROM action_report
will give you what you want in a timely manner. You would then schedule a cron job to recreate that table on a regular basis.
This approach of course won't help if the time range changes with every query or if that query is only run once a day.
Set work_mem
to say 2GB and see if that changes the plan. If it doesn't, you might be out of options.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With