Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most efficient way to remove duplicates - Postgres

I have always deleted duplicates with this kind of query:

delete from test a
using test b 
where a.ctid < b.ctid
and a.col1=b.col1
and a.col2=b.col2
and a.col3=b.col3

Also, I have seen this query being used:

DELETE FROM test WHERE test.ctid NOT IN 
(SELECT ctid FROM (
    SELECT DISTINCT ON (col1, col2) *
  FROM test));

And even this one (repeated until you run out of duplicates):

delete from test ju where ju.ctid in 
(select ctid from (
select  distinct on (col1, col2) * from test ou
where (select count(*) from test inr
where inr.col1= ou.col1 and inr.col2=ou.col2) > 1

Now I have run into a table with 5 million rows, which have indexes in the columns that are going to match in the where clause. And now I wonder:

Which, of all those methods that apparently do the same, is the most efficient and why? I just run the second one and it is taking it over 45 minutes to remove duplicates. I'm just curious about which would be the most efficient one, in case I have to remove duplicates from another huge table. It wouldn't matter if it has a primary key in the first place, you can always create it or not.

like image 405
A.T. Avatar asked Dec 18 '22 19:12

A.T.


1 Answers

demo:db<>fiddle

Finding duplicates can be easily achieved by using row_number() window function:

SELECT ctid 
FROM(
    SELECT 
        *, 
        ctid,
        row_number() OVER (PARTITION BY col1, col2, col3 ORDER BY ctid) 
    FROM test
)s
WHERE row_number >= 2

This orders groups tied rows and adds a row counter. So every row with row_number > 1 is a duplicate which can be deleted:

DELETE 
FROM test
WHERE ctid IN 
(
    SELECT ctid 
    FROM(
        SELECT 
            *, 
            ctid,
            row_number() OVER (PARTITION BY col1, col2, col3 ORDER BY ctid) 
        FROM test
    )s
    WHERE row_number >= 2
)

I don't know if this solution is faster than your attempts but your could give it a try.

Furthermore - as @a_horse_with_no_name already stated - I would recommend to use an own identifier instead of ctid for performance issues.


Edit:

For my test data your first version seems to be a little bit faster than my solution. Your second version seems to be slower and your third version does not work for me (after fixing the compiling errors it shows no result).

demo:db<>fiddle

like image 50
S-Man Avatar answered Dec 28 '22 09:12

S-Man