I have duplicate rows in my table and I want to delete duplicates in the most efficient way since the table is big. After some research, I have come up with this query:
WITH TempEmp AS ( SELECT name, ROW_NUMBER() OVER(PARTITION by name, address, zipcode ORDER BY name) AS duplicateRecCount FROM mytable ) -- Now Delete Duplicate Records DELETE FROM TempEmp WHERE duplicateRecCount > 1;
But it only works in SQL, not in Netezza. It would seem that it does not like the DELETE
after the WITH
clause?
Deleting Multiple Duplicates. Select the RowID you want to delete. After "SQL," enter "select rowid, name from names;." Delete the duplicate.
Use DataFrame. drop_duplicates() to Drop Duplicate and Keep First Rows. You can use DataFrame. drop_duplicates() without any arguments to drop rows with the same values on all columns.
DELETE FROM tablename WHERE id IN (SELECT id FROM (SELECT id, ROW_NUMBER () OVER (partition BY column1, column2, column3 ORDER BY id) AS rnum FROM tablename) t WHERE t.rnum > 1); Sometimes a timestamp field is used instead of an ID field. For smaller tables, we can use rowid pseudo column to delete duplicate rows.
1. Select the list of data you want to remove duplicates from, and click Kutools > Select > Select Duplicate & Unique Cells. See screenshot: 2. In the Select Duplicate & Unique Cells dialog, check Duplicates (Except 1st one) option in the Rule section.
It removes the rows having the value of [DuplicateCount] greater than 1 We can use the SQL RANK function to remove the duplicate rows as well. SQL RANK function gives unique row ID for each row irrespective of the duplicate row. In the following query, we use a RANK function with the PARTITION BY clause.
Add a Sort operator from the SSIS toolbox for SQL delete operation and join it with the source data For the configuration of the Sort operator, double click on it and select the columns that contain duplicate values. In our case, duplicate value is in [FirstName], [LastName], [Country] columns.
I like @erwin-brandstetter 's solution, but wanted to show a solution with the USING
keyword:
DELETE FROM table_with_dups T1 USING table_with_dups T2 WHERE T1.ctid < T2.ctid -- delete the "older" ones AND T1.name = T2.name -- list columns that define duplicates AND T1.address = T2.address AND T1.zipcode = T2.zipcode;
If you want to review the records before deleting them, then simply replace DELETE
with SELECT *
and USING
with a comma ,
, i.e.
SELECT * FROM table_with_dups T1 , table_with_dups T2 WHERE T1.ctid < T2.ctid -- select the "older" ones AND T1.name = T2.name -- list columns that define duplicates AND T1.address = T2.address AND T1.zipcode = T2.zipcode;
Update: I tested some of the different solutions here for speed. If you don't expect many duplicates, then this solution performs much better than the ones that have a NOT IN (...)
clause as those generate a lot of rows in the subquery.
If you rewrite the query to use IN (...)
then it performs similarly to the solution presented here, but the SQL code becomes much less concise.
Update 2: If you have NULL
values in one of the key columns (which you really shouldn't IMO), then you can use COALESCE()
in the condition for that column, e.g.
AND COALESCE(T1.col_with_nulls, '[NULL]') = COALESCE(T2.col_with_nulls, '[NULL]')
If you have no other unique identifier, you can use ctid
:
delete from mytable where exists (select 1 from mytable t2 where t2.name = mytable.name and t2.address = mytable.address and t2.zip = mytable.zip and t2.ctid > mytable.ctid );
It is a good idea to have a unique, auto-incrementing id in every table. Doing a delete
like this is one important reason why.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With