Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Delete duplicate records based on multiple columns

In our system we run hourly imports from an external database. Due to an error in the import scripts, there are now some duplicate records.

A duplicate is deemed where any record has the same :legacy_id and :company.

What code can I run to find and delete these duplicates?

I was playing around with this:

Product.select(:legacy_id,:company).group(:legacy_id,:company).having("count(*) > 1")

It seemed to return some of the duplicates, but I wasn't sure how to delete from there?

Any ideas?

like image 550
bnussey Avatar asked Nov 28 '14 13:11

bnussey


People also ask

How do I remove duplicate rows based on multiple columns in SQL?

In SQL, some rows contain duplicate entries in multiple columns(>1). For deleting such rows, we need to use the DELETE keyword along with self-joining the table with itself.

How do I delete duplicate rows based on multiple columns in pandas?

By using pandas. DataFrame. drop_duplicates() method you can drop/remove/delete duplicate rows from DataFrame. Using this method you can drop duplicate rows on selected multiple columns or all columns.


1 Answers

You can try the following approach:

Product.where.not(
  id: Product.group(:legacy_id, :company).pluck('min(products.id)')
).delete_all

Or pure sql:

delete from products
where id not in ( 
   select min(p.id) from products p group by p.legacy_id, p.company
)
like image 91
potashin Avatar answered Oct 13 '22 12:10

potashin