Speeding up a group by date query on a big table in postgres

Question

I've got a table with around 20 million rows. For arguments sake, lets say there are two columns in the table - an id and a timestamp. I'm trying to get a count of the number of items per day. Here's what I have at the moment.

  SELECT DATE(timestamp) AS day, COUNT(*)
    FROM actions
   WHERE DATE(timestamp) >= '20100101'
     AND DATE(timestamp) <  '20110101'
GROUP BY day;

Without any indices, this takes about a 30s to run on my machine. Here's the explain analyze output:

 GroupAggregate  (cost=675462.78..676813.42 rows=46532 width=8) (actual time=24467.404..32417.643 rows=346 loops=1)
   ->  Sort  (cost=675462.78..675680.34 rows=87021 width=8) (actual time=24466.730..29071.438 rows=17321121 loops=1)
         Sort Key: (date("timestamp"))
         Sort Method:  external merge  Disk: 372496kB
         ->  Seq Scan on actions  (cost=0.00..667133.11 rows=87021 width=8) (actual time=1.981..12368.186 rows=17321121 loops=1)
               Filter: ((date("timestamp") >= '2010-01-01'::date) AND (date("timestamp") < '2011-01-01'::date))
 Total runtime: 32447.762 ms

Since I'm seeing a sequential scan, I tried to index on the date aggregate

CREATE INDEX ON actions (DATE(timestamp));

Which cuts the speed by about 50%.

 HashAggregate  (cost=796710.64..796716.19 rows=370 width=8) (actual time=17038.503..17038.590 rows=346 loops=1)
   ->  Seq Scan on actions  (cost=0.00..710202.27 rows=17301674 width=8) (actual time=1.745..12080.877 rows=17321121 loops=1)
         Filter: ((date("timestamp") >= '2010-01-01'::date) AND (date("timestamp") < '2011-01-01'::date))
 Total runtime: 17038.663 ms

I'm new to this whole query-optimization business, and I have no idea what to do next. Any clues how I could get this query running faster?

--edit--

It looks like I'm hitting the limits of indices. This is pretty much the only query that gets run on this table (though the values of the dates change). Is there a way to partition up the table? Or create a cache table with all the count values? Or any other options?

a_horse_with_no_name · Accepted Answer

Is there a way to partition up the table?

Yes:
http://www.postgresql.org/docs/current/static/ddl-partitioning.html

Or create a cache table with all the count values? Or any other options?

Create a "cache" table certainly is possible. But this depends on how often you need that result and how accurate it needs to be.

CREATE TABLE action_report
AS
SELECT DATE(timestamp) AS day, COUNT(*)
    FROM actions
   WHERE DATE(timestamp) >= '20100101'
     AND DATE(timestamp) <  '20110101'
GROUP BY day;

Then a SELECT * FROM action_report will give you what you want in a timely manner. You would then schedule a cron job to recreate that table on a regular basis.

This approach of course won't help if the time range changes with every query or if that query is only run once a day.

Peter Eisentraut · Answer

Set work_mem to say 2GB and see if that changes the plan. If it doesn't, you might be out of options.

Speeding up a group by date query on a big table in postgres

Tags:

sql

database

indexing

postgresql

zaius

2 Answers

a_horse_with_no_name

Peter Eisentraut

Recent Activity

Donate For Us

Speeding up a group by date query on a big table in postgres

Tags:

sql

database

indexing

postgresql

zaius

2 Answers

a_horse_with_no_name

Peter Eisentraut

Related questions

Recent Activity

Donate For Us