PostgreSQL delete all but the oldest records

Question

I have a PostgreSQL database that has multiple entries for the objectid, on multiple devicenames, but there is a unique timestamp for each entry. The table looks something like this:

address | devicename | objectid      |  timestamp       
--------+------------+---------------+------------------------------
1.1.1.1 | device1    | vs_hub.ch1_25 | 2012-10-02 17:36:41.011629+00
1.1.1.2 | device2    | vs_hub.ch1_25 | 2012-10-02 17:48:01.755559+00
1.1.1.1 | device1    | vs_hub.ch1_25 | 2012-10-03 15:37:09.06065+00
1.1.1.2 | device2    | vs_hub.ch1_25 | 2012-10-03 15:48:33.93128+00
1.1.1.1 | device1    | vs_hub.ch1_25 | 2012-10-05 16:01:59.266779+00
1.1.1.2 | device2    | vs_hub.ch1_25 | 2012-10-05 16:13:46.843113+00
1.1.1.1 | device1    | vs_hub.ch1_25 | 2012-10-06 01:11:45.853361+00
1.1.1.2 | device2    | vs_hub.ch1_25 | 2012-10-06 01:23:21.204324+00

I want to delete all but the oldest entry for each odjectid and devicename. In this case I want to delete all but:

1.1.1.1 | device1 | vs_hub.ch1_25 | 2012-10-02 17:36:41.011629+00
1.1.1.2 | device2 | vs_hub.ch1_25 | 2012-10-02 17:48:01.755559+00

Is there a way do this? Or is it possible to select the oldest entries for both "objectid and devicename" into a temp table?

Erwin Brandstetter · Accepted Answer

To distill the described result, this would probably simplest and fastest:

SELECT DISTINCT ON (devicename, objectid) *
FROM   tbl
ORDER  BY devicename, objectid, ts DESC;

Details and explanation in this related answer.

From your sample data, I conclude that you are going to delete large portions of the original table. It is probably faster to just TRUNCATE the table (or DROP & recreate, since you should add a surrogate pk column anyway) and write the remaining rows to it. This also provides you with a pristine table, implicitly clustered (ordered) the way it's best for your queries and save the work that VACUUM would have to do otherwise. And it's probably still faster overall:

I would also strongly advise to add a surrogate primary key to your table, preferably a serial column.

BEGIN;

CREATE TEMP TABLE tmp_tbl ON COMMIT DROP AS
SELECT DISTINCT ON (devicename, objectid) *
FROM   tbl
ORDER  BY devicename, objectid, ts DESC;

TRUNCATE tbl;
ALTER TABLE tbl ADD column tbl_id serial PRIMARY KEY;

-- or, if you can afford to drop & recreate:
-- DROP TABLE tbl;
-- CREATE TABLE tbl (
--   tbl_id serial PRIMARY KEY
-- , address text
-- , devicename text
-- , objectid text
-- , ts timestamp);

INSERT INTO tbl (address, devicename, objectid, ts)
SELECT address, devicename, objectid, ts
FROM   tmp_tbl;

COMMIT;

Do it all within a transaction to make sure you are not going to fail half way through.

This is fast as long as your setting for temp_buffers is big enough to hold the temporary table. Else the system will start swapping data to disk and performance takes a dive. You can set temp_buffers just for the current session like this:

SET temp_buffers = 1000MB;

So you don't waste RAM that you don't normally need for temp_buffers. Has to be set before the first use of any temporary objects in the session. More information in this related answer.

Also, as the INSERT follows a TRUNCATE inside a transaction, it will be easy on the Write Ahead Log - improving performance.

Consider CREATE TABLE AS for the alternative route:

What causes large INSERT to slow down and disk usage to explode?

The only downside: You need an exclusive lock on the table. This may be a problem in databases with heavy concurrent load.

Finally, never use timestamp as column name. It's a reserved word in every SQL standard and a type name in PostgreSQL. I used ts instead.

a_horse_with_no_name · Answer

This should do it:

delete from devices
using (
   select ctid as cid, 
          row_number() over (partition by devicename, objectid order by timestamp asc) as rn
   from devices
) newest
where newest.cid = devices.ctid
and newest.rn <> 1;

It creates a derived table that will assign unique numbers to each combination of (address, devicename, objectid) giving the earliest one (the one with the smallest timestamp value) the number 1. Then this result is used to delete all those that do not have the number 1. The virtual column ctid is used to uniquely identify those rows (it's an internal identifier supplied by Postgres).

Note that for deleting a really large amount of rows, Erwin's approach will most definitely be faster.

SQLFiddle demo: http://www.sqlfiddle.com/#!1/5d9fe/2

PostgreSQL delete all but the oldest records

Tags:

sql

postgresql

duplicate-removal

dars33

2 Answers

Erwin Brandstetter

a_horse_with_no_name

Recent Activity

Donate For Us

PostgreSQL delete all but the oldest records

Tags:

sql

postgresql

duplicate-removal

dars33

2 Answers

Erwin Brandstetter

a_horse_with_no_name

Related questions

Recent Activity

Donate For Us