In our system we run hourly imports from an external database. Due to an error in the import scripts, there are now some duplicate records.
A duplicate is deemed where any record has the same :legacy_id
and :company
.
What code can I run to find and delete these duplicates?
I was playing around with this:
Product.select(:legacy_id,:company).group(:legacy_id,:company).having("count(*) > 1")
It seemed to return some of the duplicates, but I wasn't sure how to delete from there?
Any ideas?
In SQL, some rows contain duplicate entries in multiple columns(>1). For deleting such rows, we need to use the DELETE keyword along with self-joining the table with itself.
By using pandas. DataFrame. drop_duplicates() method you can drop/remove/delete duplicate rows from DataFrame. Using this method you can drop duplicate rows on selected multiple columns or all columns.
You can try the following approach:
Product.where.not(
id: Product.group(:legacy_id, :company).pluck('min(products.id)')
).delete_all
Or pure sql:
delete from products
where id not in (
select min(p.id) from products p group by p.legacy_id, p.company
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With