Count the number of matching rows for high magnitude counts (100,000+)

Question

Objective:

Get the number of times something happened between two times when the order of magnitude of the count is 100,000 - 10,000,000.

Current implementation:

Using PostgreSQL
Each "incident" is recorded as a separate row in a table

The columns:

Incident type
Date-Time that it occurred

The query to get the count (pseudocode):

COUNT rows WHERE time_occurred > <begin_time> AND time_occurred < <end_time>

The problem:

This works, but the query is very inefficient and takes about 40 seconds to respond. As I understand it, PostgreSQL is not a good database to use for this type of query.

I sat down and thought up a few ways that this type of query could be indexed and executed in O(log n) time so I know t is possible.

What tools should I be using to do this? Should we be using a different database to store the count rows? Is there a package we could install on top of PostgreSQL to do this easily? What are our options?

Note:

Not sure if I was clear about this. The result of COUNT should be on the order of 100,000 - 10,000,000. This means that the number of rows that match the query would be on the order of 100,000 - 10,000,000. The actual number of rows in the table is an order of magnitude more.

Thanks so much!

nathan-m · Accepted Answer

Pre-PostgreSQL 9.2 the implementation of MVCC required any query to visit each row of the table to check if that version of the row was visible to the current transaction. This would happen, even if the query only involved indexed columns. This manifests as slow counts on large tables, even for simple cases.

PostgreSQL 9.2 implements index only scans, which may help alleviate this issue for some workloads.

If you are stuck below v9.2, there are some known workarounds if you only need an approximate row count on a simple query. See http://wiki.postgresql.org/wiki/Count_estimate .

Clodoaldo Neto · Answer

Keep a table of incidents aggregated by day.

create table incidents_agreggated_by_day (
    "day" date primary key, total integer
);

Everyday run:

insert into events_agreggated_by_day ("day", total) values
select date_trunc('day', time_occurred), count(*) total
from incidents
where 
    time_occurred < current_date
    and date_trunc('day', time_occurred) not in (
        select "day" from incidents_agreggated_by_day
    )
group by 1

Suppose you want the total between '2013-01-01 10:37' and '2013-03-02 11:20':

select
(
    select sum(total)
    from incidents_aggregated_by_day
    where "day" >= '2013-01-02'::date and "day" < '2013-03-02'::date
) +
(
    select count(*)
    from incidents
    where 
        time_ocurred >= '2013-01-01 10:37':timestamp
        and time_ocurred < '2013-01-02'
        or
        time_ocurred <= '2013-03-02 11:20':timestamp
        and time_ocurred >= '2013-01-02'
) total

In instead of reading 100 million rows you will read hundreds or thousands. If properly indexed it will be fast.

Count the number of matching rows for high magnitude counts (100,000+)

Tags:

sql

postgresql

count

nosql

counter

Objective:

Current implementation:

The problem:

Note:

Chris Dutrow

2 Answers

nathan-m

Clodoaldo Neto

Recent Activity

Donate For Us

Count the number of matching rows for high magnitude counts (100,000+)

Tags:

sql

postgresql

count

nosql

counter

Objective:

Current implementation:

The problem:

Note:

Chris Dutrow

2 Answers

nathan-m

Clodoaldo Neto

Related questions

Recent Activity

Donate For Us