Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count the number of matching rows for high magnitude counts (100,000+)

Objective:

Get the number of times something happened between two times when the order of magnitude of the count is 100,000 - 10,000,000.

Current implementation:

  • Using PostgreSQL
  • Each "incident" is recorded as a separate row in a table

The columns:

  • Incident type
  • Date-Time that it occurred

The query to get the count (pseudocode):

COUNT rows WHERE time_occurred > <begin_time> AND time_occurred < <end_time>

The problem:

This works, but the query is very inefficient and takes about 40 seconds to respond. As I understand it, PostgreSQL is not a good database to use for this type of query.

I sat down and thought up a few ways that this type of query could be indexed and executed in O(log n) time so I know t is possible.

What tools should I be using to do this? Should we be using a different database to store the count rows? Is there a package we could install on top of PostgreSQL to do this easily? What are our options?

Note:

Not sure if I was clear about this. The result of COUNT should be on the order of 100,000 - 10,000,000. This means that the number of rows that match the query would be on the order of 100,000 - 10,000,000. The actual number of rows in the table is an order of magnitude more.

Thanks so much!

like image 278
Chris Dutrow Avatar asked Mar 10 '13 02:03

Chris Dutrow


2 Answers

Pre-PostgreSQL 9.2 the implementation of MVCC required any query to visit each row of the table to check if that version of the row was visible to the current transaction. This would happen, even if the query only involved indexed columns. This manifests as slow counts on large tables, even for simple cases.

PostgreSQL 9.2 implements index only scans, which may help alleviate this issue for some workloads.

If you are stuck below v9.2, there are some known workarounds if you only need an approximate row count on a simple query. See http://wiki.postgresql.org/wiki/Count_estimate .

like image 132
nathan-m Avatar answered Oct 14 '22 03:10

nathan-m


Keep a table of incidents aggregated by day.

create table incidents_agreggated_by_day (
    "day" date primary key, total integer
);

Everyday run:

insert into events_agreggated_by_day ("day", total) values
select date_trunc('day', time_occurred), count(*) total
from incidents
where 
    time_occurred < current_date
    and date_trunc('day', time_occurred) not in (
        select "day" from incidents_agreggated_by_day
    )
group by 1

Suppose you want the total between '2013-01-01 10:37' and '2013-03-02 11:20':

select
(
    select sum(total)
    from incidents_aggregated_by_day
    where "day" >= '2013-01-02'::date and "day" < '2013-03-02'::date
) +
(
    select count(*)
    from incidents
    where 
        time_ocurred >= '2013-01-01 10:37':timestamp
        and time_ocurred < '2013-01-02'
        or
        time_ocurred <= '2013-03-02 11:20':timestamp
        and time_ocurred >= '2013-01-02'
) total

In instead of reading 100 million rows you will read hundreds or thousands. If properly indexed it will be fast.

like image 28
Clodoaldo Neto Avatar answered Oct 14 '22 04:10

Clodoaldo Neto