Remove duplicate SQL rows by looking at all columns

Question

I have this table, where every column is a VARCHAR (or equivalent):

field001 field002 field003 field004 field005 .... field500

500 VARCHAR columns. No primary keys. And no column is guaranteed to be unique. So the only way to know for sure if two rows are the same is to compare the values of all columns.

(Yes, this should be in TheDailyWTF. No, it's not my fault. Bear with me here).

I inserted a duplicate set of rows by mistake, and I need to find them and remove them.

There's 12 million rows on this table, so I'd rather not recreate it.

However, I do know what rows were mistakenly inserted (I have the .sql file).

So I figured I'd create another table and load it with those. And then I'd do some sort of join that would compare all columns on both tables and then delete the rows that are equal from the first table. I tried a NATURAL JOIN as that looked promising, but nothing was returned.

What are my options?

I'm using Amazon Redshift (so PostgreSQL 8.4 if I recall), but I think this is a general SQL question.

Admin · Accepted Answer

You can treat the whole row as a single record in Postgres (and thus I think in Redshift).

The following works in Postgres, and will keep one of the duplicates

delete from the_table
where ctid not in (select min(ctid)
                   from the_table
                   group by the_table); --<< Yes, the group by is correct!

This is going to be slow!

Grouping over so many columns and then deleting with a NOT IN will take quite some time. Especially if a lot of rows are going to be deleted.

If you want to delete all duplicate rows (not keeping any of them), you can use the following:

delete from the_table
where the_table in (select the_table
                    from the_table
                    group by the_table
                    having count(*) > 1);

Remove duplicate SQL rows by looking at all columns

Tags:

sql

postgresql

amazon-redshift

user2661269

1 Answers

Recent Activity

Donate For Us

Remove duplicate SQL rows by looking at all columns

Tags:

sql

postgresql

amazon-redshift

user2661269

1 Answers

Related questions

Recent Activity

Donate For Us