Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speeding up a group by date query on a big table in postgres

I've got a table with around 20 million rows. For arguments sake, lets say there are two columns in the table - an id and a timestamp. I'm trying to get a count of the number of items per day. Here's what I have at the moment.

  SELECT DATE(timestamp) AS day, COUNT(*)
    FROM actions
   WHERE DATE(timestamp) >= '20100101'
     AND DATE(timestamp) <  '20110101'
GROUP BY day;

Without any indices, this takes about a 30s to run on my machine. Here's the explain analyze output:

 GroupAggregate  (cost=675462.78..676813.42 rows=46532 width=8) (actual time=24467.404..32417.643 rows=346 loops=1)
   ->  Sort  (cost=675462.78..675680.34 rows=87021 width=8) (actual time=24466.730..29071.438 rows=17321121 loops=1)
         Sort Key: (date("timestamp"))
         Sort Method:  external merge  Disk: 372496kB
         ->  Seq Scan on actions  (cost=0.00..667133.11 rows=87021 width=8) (actual time=1.981..12368.186 rows=17321121 loops=1)
               Filter: ((date("timestamp") >= '2010-01-01'::date) AND (date("timestamp") < '2011-01-01'::date))
 Total runtime: 32447.762 ms

Since I'm seeing a sequential scan, I tried to index on the date aggregate

CREATE INDEX ON actions (DATE(timestamp));

Which cuts the speed by about 50%.

 HashAggregate  (cost=796710.64..796716.19 rows=370 width=8) (actual time=17038.503..17038.590 rows=346 loops=1)
   ->  Seq Scan on actions  (cost=0.00..710202.27 rows=17301674 width=8) (actual time=1.745..12080.877 rows=17321121 loops=1)
         Filter: ((date("timestamp") >= '2010-01-01'::date) AND (date("timestamp") < '2011-01-01'::date))
 Total runtime: 17038.663 ms

I'm new to this whole query-optimization business, and I have no idea what to do next. Any clues how I could get this query running faster?

--edit--

It looks like I'm hitting the limits of indices. This is pretty much the only query that gets run on this table (though the values of the dates change). Is there a way to partition up the table? Or create a cache table with all the count values? Or any other options?

like image 799
zaius Avatar asked Jan 13 '11 00:01

zaius


2 Answers

Is there a way to partition up the table?

Yes:
http://www.postgresql.org/docs/current/static/ddl-partitioning.html

Or create a cache table with all the count values? Or any other options?

Create a "cache" table certainly is possible. But this depends on how often you need that result and how accurate it needs to be.

CREATE TABLE action_report
AS
SELECT DATE(timestamp) AS day, COUNT(*)
    FROM actions
   WHERE DATE(timestamp) >= '20100101'
     AND DATE(timestamp) <  '20110101'
GROUP BY day;

Then a SELECT * FROM action_report will give you what you want in a timely manner. You would then schedule a cron job to recreate that table on a regular basis.

This approach of course won't help if the time range changes with every query or if that query is only run once a day.

like image 102
a_horse_with_no_name Avatar answered Oct 14 '22 00:10

a_horse_with_no_name


Set work_mem to say 2GB and see if that changes the plan. If it doesn't, you might be out of options.

like image 27
Peter Eisentraut Avatar answered Oct 13 '22 22:10

Peter Eisentraut