Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to delete duplicate rows without unique identifier

I have duplicate rows in my table and I want to delete duplicates in the most efficient way since the table is big. After some research, I have come up with this query:

WITH TempEmp AS ( SELECT name, ROW_NUMBER() OVER(PARTITION by name, address, zipcode ORDER BY name) AS duplicateRecCount FROM mytable ) -- Now Delete Duplicate Records DELETE FROM TempEmp WHERE duplicateRecCount > 1; 

But it only works in SQL, not in Netezza. It would seem that it does not like the DELETE after the WITH clause?

like image 410
moe Avatar asked Nov 05 '14 23:11

moe


People also ask

How can delete duplicate rows using row ID?

Deleting Multiple Duplicates. Select the RowID you want to delete. After "SQL," enter "select rowid, name from names;." Delete the duplicate.

How do I eliminate duplicate rows?

Use DataFrame. drop_duplicates() to Drop Duplicate and Keep First Rows. You can use DataFrame. drop_duplicates() without any arguments to drop rows with the same values on all columns.

How to delete duplicate rows in SQL Server?

DELETE FROM tablename WHERE id IN (SELECT id FROM (SELECT id, ROW_NUMBER () OVER (partition BY column1, column2, column3 ORDER BY id) AS rnum FROM tablename) t WHERE t.rnum > 1); Sometimes a timestamp field is used instead of an ID field. For smaller tables, we can use rowid pseudo column to delete duplicate rows.

How to remove duplicates from list of cells except 1 in Excel?

1. Select the list of data you want to remove duplicates from, and click Kutools > Select > Select Duplicate & Unique Cells. See screenshot: 2. In the Select Duplicate & Unique Cells dialog, check Duplicates (Except 1st one) option in the Rule section.

How to remove [duplicatecount] greater than 1 in SQL?

It removes the rows having the value of [DuplicateCount] greater than 1 We can use the SQL RANK function to remove the duplicate rows as well. SQL RANK function gives unique row ID for each row irrespective of the duplicate row. In the following query, we use a RANK function with the PARTITION BY clause.

How to delete duplicate columns in SSIs?

Add a Sort operator from the SSIS toolbox for SQL delete operation and join it with the source data For the configuration of the Sort operator, double click on it and select the columns that contain duplicate values. In our case, duplicate value is in [FirstName], [LastName], [Country] columns.


2 Answers

I like @erwin-brandstetter 's solution, but wanted to show a solution with the USING keyword:

DELETE   FROM table_with_dups T1   USING       table_with_dups T2 WHERE  T1.ctid    < T2.ctid       -- delete the "older" ones   AND  T1.name    = T2.name       -- list columns that define duplicates   AND  T1.address = T2.address   AND  T1.zipcode = T2.zipcode; 

If you want to review the records before deleting them, then simply replace DELETE with SELECT * and USING with a comma ,, i.e.

SELECT * FROM table_with_dups T1   ,           table_with_dups T2 WHERE  T1.ctid    < T2.ctid       -- select the "older" ones   AND  T1.name    = T2.name       -- list columns that define duplicates   AND  T1.address = T2.address   AND  T1.zipcode = T2.zipcode; 

Update: I tested some of the different solutions here for speed. If you don't expect many duplicates, then this solution performs much better than the ones that have a NOT IN (...) clause as those generate a lot of rows in the subquery.

If you rewrite the query to use IN (...) then it performs similarly to the solution presented here, but the SQL code becomes much less concise.

Update 2: If you have NULL values in one of the key columns (which you really shouldn't IMO), then you can use COALESCE() in the condition for that column, e.g.

  AND COALESCE(T1.col_with_nulls, '[NULL]') = COALESCE(T2.col_with_nulls, '[NULL]') 
like image 197
isapir Avatar answered Sep 27 '22 23:09

isapir


If you have no other unique identifier, you can use ctid:

delete from mytable     where exists (select 1                   from mytable t2                   where t2.name = mytable.name and                         t2.address = mytable.address and                         t2.zip = mytable.zip and                         t2.ctid > mytable.ctid                  ); 

It is a good idea to have a unique, auto-incrementing id in every table. Doing a delete like this is one important reason why.

like image 21
Gordon Linoff Avatar answered Sep 27 '22 21:09

Gordon Linoff