Get the number of times something happened between two times when the order of magnitude of the count is 100,000 - 10,000,000.
The columns:
The query to get the count (pseudocode):
COUNT rows WHERE time_occurred > <begin_time> AND time_occurred < <end_time>
This works, but the query is very inefficient and takes about 40 seconds to respond. As I understand it, PostgreSQL is not a good database to use for this type of query.
I sat down and thought up a few ways that this type of query could be indexed and executed in O(log n) time so I know t is possible.
What tools should I be using to do this? Should we be using a different database to store the count rows? Is there a package we could install on top of PostgreSQL to do this easily? What are our options?
Not sure if I was clear about this. The result of COUNT
should be on the order of 100,000 - 10,000,000. This means that the number of rows that match the query would be on the order of 100,000 - 10,000,000. The actual number of rows in the table is an order of magnitude more.
Thanks so much!
Pre-PostgreSQL 9.2 the implementation of MVCC required any query to visit each row of the table to check if that version of the row was visible to the current transaction. This would happen, even if the query only involved indexed columns. This manifests as slow counts on large tables, even for simple cases.
PostgreSQL 9.2 implements index only scans, which may help alleviate this issue for some workloads.
If you are stuck below v9.2, there are some known workarounds if you only need an approximate row count on a simple query. See http://wiki.postgresql.org/wiki/Count_estimate .
Keep a table of incidents aggregated by day.
create table incidents_agreggated_by_day (
"day" date primary key, total integer
);
Everyday run:
insert into events_agreggated_by_day ("day", total) values
select date_trunc('day', time_occurred), count(*) total
from incidents
where
time_occurred < current_date
and date_trunc('day', time_occurred) not in (
select "day" from incidents_agreggated_by_day
)
group by 1
Suppose you want the total between '2013-01-01 10:37' and '2013-03-02 11:20':
select
(
select sum(total)
from incidents_aggregated_by_day
where "day" >= '2013-01-02'::date and "day" < '2013-03-02'::date
) +
(
select count(*)
from incidents
where
time_ocurred >= '2013-01-01 10:37':timestamp
and time_ocurred < '2013-01-02'
or
time_ocurred <= '2013-03-02 11:20':timestamp
and time_ocurred >= '2013-01-02'
) total
In instead of reading 100 million rows you will read hundreds or thousands. If properly indexed it will be fast.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With