Find duplicate records in large table on multiple columns the right way

Question

I have read many threads on this subject now and tried a few things but it has not worked as I hoped. I need some clarification and apologize if this is considered a duplicate thread.

A client of mine hosts a Postgres database where one table holds a little more then 12 million records. They have tasked me with finding duplicate records, extract them for viewing and if everything looks ok, delete the duplicates.

My main concern has been performance on the server. Running DISTINCT queries on 12 million records must consume a lot of resources?

Since my first task is to extract the records for viewing in, say a CSV, and not simply deleting them my approach in PgAdmin was executing this to a file.

SELECT * 
FROM
    my_table
WHERE
my_table_id NOT IN (

                SELECT DISTINCT 
                    ON (
                        num_1,
                        num_2,
                        num_3,
                        num_4,
                        num_5,
                        my_date
                    )
                    my_table_id
                FROM
                    my_table
);

However this query takes way to long. After 20 minutes of execution time I halted the execution. To make things more complex my client is reluctant to allow me to clone a local copy of the table because of strict security. They prefer it is all done on the live hosting environment.

The table definition is quite simple. It looks like this

CREATE TABLE my_table
(
    my_table_id bigserial NOT NULL,
    num_1 bigserial NOT NULL,
    num_2 bigserial NOT NULL,
    num_3 bigserial NOT NULL,
    num_4 numeric,
    num_5 integer,
    my_date date,
    my_text character varying
)

The primary key "my_table_id" has not been compromised and is always unique. The col "my_text" is not interesting in the query since it will be empty for all duplicates. It is only the numeric fields and the date that needs matching. All columns (except my_table_id and my_text) must match across records to qualify as a duplicate.

What is the best way to solve this? Is there a server-friendly way that won´t eat all resources on the host environment? Please help me understand the best approach!

Thanks you!

radar · Accepted Answer

Need to use GROUP BY and HAVING to get duplicate records instead of DISTINCT

subquery will find all duplicate records

SELECT * FROM
my_table mt
JOIN
(
     SELECT
            num_1,
            num_2,
            num_3,
            num_4,
            num_5,
            my_date
     FROM
            my_table
     GROUP BY num_1, num_2, num_3, num_4, num_5, my_date
     HAVING COUNT(*) >1
) T 
ON mt.num_1= T.num_1
and mt.num_2= T.num_2
and mt.num_3= T.num_3
and mt.num_4= T.num_4
and mt.num_5= T.num_5
and mt.my_date= T.my_date

FuzzyTree · Answer

Another way using analytic functions

select * from (
    select * , 
    count(*) over (partition by num1,num2,num3,num4,my_date) cnt
    from mytable
) t1 where cnt > 1

Find duplicate records in large table on multiple columns the right way

Tags:

sql

postgresql

postgresql-8.4

Kalle

2 Answers

radar

FuzzyTree

Recent Activity

Donate For Us

Find duplicate records in large table on multiple columns the right way

Tags:

sql

postgresql

postgresql-8.4

Kalle

2 Answers

radar

FuzzyTree

Related questions

Recent Activity

Donate For Us