Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selecting all rows created less than 1 second apart

I have a table in PostgreSQL 9.3:

create table purchase(
   transaction_id integer,
   account_id bigint,
   created timestamp with time zone,
   price numeric(5,2)
)

I think I have a problem where a system is sending me duplicate records, but I don't know how widespread the issue is.

I need a query to select all records created within 1 second (not necessary the same second) that has the same account_id and same price. So, for example, I would want to be able to find these two records:

+----------------+----------------+-------------------------------+-------+
| transaction_id |   account_id   |            created            | price |
+----------------+----------------+-------------------------------+-------+
|          85239 | 80012340116730 | 2014-05-07 15:46:03.361959+00 |  8.47 |
|          85240 | 80012340116730 | 2014-05-07 15:46:04.118911+00 |  8.47 |
+----------------+----------------+-------------------------------+-------+

How can I do this in a single query?

like image 283
David S Avatar asked May 08 '14 19:05

David S


2 Answers

You need to check for existence of a row within a second in both directions.
And you need to exclude the row itself from the test:

SELECT *
FROM   purchase p
WHERE  EXISTS (
   SELECT FROM  purchase p1
   WHERE  p1.created > p.created - interval '1 sec' -- "less than a second"
   AND    p1.created < p.created + interval '1 sec'
   AND    p1.account_id = p.account_id
   AND    p1.price      = p.price
   AND    p1.transaction_id <> p.transaction_id   -- assuming that's the pk
   )
ORDER BY account_id, price, created;         -- optional, for handy output

These WHERE conditions are sargable, which allows the use of an index on created:

WHERE p1.created > p.created - interval '1 sec'
AND   p1.created < p.created + interval '1 sec'

As opposed to:

p1.created - p.created < interval '1 sec'

The latter cannot use an index on created which may slow down the query with big tables. Postgres is forced to test all remaining combinations (after applying other conditions). Depending on the selectivity of the other conditions and the size of the table, this may be irrelevant or a moderate to huge performance drain.
With small to medium tables, tests have shown two sequence scans and a hash semi join for either query.

Index

The perfect index for the case would be a multicolumn index of the form:

CREATE INDEX purchase_foo_idx ON purchase (account_id, price, created)

Combinations of indexes on individual columns work well, too (and might have more use cases).

like image 72
Erwin Brandstetter Avatar answered Sep 28 '22 04:09

Erwin Brandstetter


I think you're looking for something like this:

select *
from purchase p1
where exists (
    select transaction_id 
    from purchase p2 
    where p2.created > p1.created
    and p2.created - p1.created < interval '1 second'
    and p2.account_id = p1.account_id
    and p2.price = p1.price)

EDIT: The query could be very heavy on big table. Consider limit it e.g. to one day:

select *
from purchase p1
where 
    p1.created::date = '2014-05-08'
    and exists (
        select transaction_id 
        from purchase p2 
        where p2.created::date = '2014-05-08'
        and p2.created > p1.created
        and p2.created - p1.created < interval '1 second'
        and p2.account_id = p1.account_id
        and p2.price = p1.price)

SQL Fiddle

like image 31
klin Avatar answered Sep 28 '22 04:09

klin