How do I delete duplicates rows in Postgres 9 table, the rows are completely duplicates on every field AND there is no individual field that could be used as a unique key so I cant just GROUP BY
columns and use a NOT IN
statement.
I'm looking for a single SQL statement, not a solution that requires me to create temporary table and insert records into that. I know how to do that but requires more work to fit into my automated process.
Table definition:
jthinksearch=> \d releases_labels;
Unlogged table "discogs.releases_labels"
Column | Type | Modifiers
------------+---------+-----------
label | text |
release_id | integer |
catno | text |
Indexes:
"releases_labels_catno_idx" btree (catno)
"releases_labels_name_idx" btree (label)
Foreign-key constraints:
"foreign_did" FOREIGN KEY (release_id) REFERENCES release(id)
Sample data:
jthinksearch=> select * from releases_labels where release_id=6155;
label | release_id | catno
--------------+------------+------------
Warp Records | 6155 | WAP 39 CDR
Warp Records | 6155 | WAP 39 CDR
RANK function to SQL delete duplicate rows We can use the SQL RANK function to remove the duplicate rows as well. SQL RANK function gives unique row ID for each row irrespective of the duplicate row. In the following query, we use a RANK function with the PARTITION BY clause.
So to delete the duplicate record with SQL Server we can use the SET ROWCOUNT command to limit the number of rows affected by a query. By setting it to 1 we can just delete one of these rows in the table. Note: the select commands are just used to show the data prior and after the delete occurs.
If a table has duplicate rows, we can delete it by using the DELETE statement. In the case, we have a column, which is not the part of group used to evaluate the duplicate records in the table.
You could do this without a problem by using a common table expression (CTE), you don't need to use any temporary tables at all. Just be careful if the delete is going against a high traffic table. Deleting large amounts of data can cause locking and blocking, also the tran log will be hit.
If you can afford to rewrite the whole table, this is probably the simplest approach:
WITH Deleted AS (
DELETE FROM discogs.releases_labels
RETURNING *
)
INSERT INTO discogs.releases_labels
SELECT DISTINCT * FROM Deleted
If you need to specifically target the duplicated records, you can make use of the internal ctid
field, which uniquely identifies a row:
DELETE FROM discogs.releases_labels
WHERE ctid NOT IN (
SELECT MIN(ctid)
FROM discogs.releases_labels
GROUP BY label, release_id, catno
)
Be very careful with ctid
; it changes over time. But you can rely on it staying the same within the scope of a single statement.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With