I've seen a couple of solutions for this, but I'm wondering what the best and most efficient way is to de-dupe a table. You can use code (SQL, etc.) to illustrate your point, but I'm just looking for basic algorithms. I assumed there would already be a question about this on SO, but I wasn't able to find one, so if it already exists just give me a heads up.
(Just to clarify - I'm referring to getting rid of duplicates in a table that has an incremental automatic PK and has some rows that are duplicates in everything but the PK field.)
We can use Common Table Expressions commonly known as CTE to remove duplicate rows in SQL Server. It is available starting from SQL Server 2005. We use a SQL ROW_NUMBER function, and it adds a unique sequential row number for the row.
DELETE Duplicate Records Using ROWCOUNT So to delete the duplicate record with SQL Server we can use the SET ROWCOUNT command to limit the number of rows affected by a query. By setting it to 1 we can just delete one of these rows in the table.
SELECT DISTINCT <insert all columns but the PK here> FROM foo
. Create a temp table using that query (the syntax varies by RDBMS but there's typically a SELECT … INTO
or CREATE TABLE AS
pattern available), then blow away the old table and pump the data from the temp table back into it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With