Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove duplicate records based on multiple columns?

I'm using Heroku to host my Ruby on Rails application and for one reason or another, I may have some duplicate rows.

Is there a way to delete duplicate records based on 2 or more criteria but keep just 1 record of that duplicate collection?

In my use case, I have a Make and Model relationship for cars in my database.

Make      Model ---       --- Name      Name           Year           Trim           MakeId 

I'd like to delete all Model records that have the same Name, Year and Trim but keep 1 of those records (meaning, I need the record but only once). I'm using Heroku console so I can run some active record queries easily.

Any suggestions?

like image 958
sergserg Avatar asked Jan 02 '13 15:01

sergserg


People also ask

How do I remove duplicate rows based on multiple columns in SQL?

In SQL, some rows contain duplicate entries in multiple columns(>1). For deleting such rows, we need to use the DELETE keyword along with self-joining the table with itself.

How do I delete duplicate rows based on multiple columns in pandas?

Delete Duplicate Rows based on Specific Columns To delete duplicate rows on the basis of multiple columns, specify all column names as a list. You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows.


2 Answers

class Model    def self.dedupe     # find all models and group them on keys which should be common     grouped = all.group_by{|model| [model.name,model.year,model.trim,model.make_id] }     grouped.values.each do |duplicates|       # the first one we want to keep right?       first_one = duplicates.shift # or pop for last one       # if there are any more left, they are duplicates       # so delete all of them       duplicates.each{|double| double.destroy} # duplicates can now be destroyed     end   end  end  Model.dedupe 
  • Find All
  • Group them on keys which you need for uniqueness
  • Loop on the grouped model's values of the hash
  • remove the first value because you want to retain one copy
  • delete the rest
like image 76
Aditya Sanghi Avatar answered Sep 20 '22 04:09

Aditya Sanghi


If your User table data like below

User.all => [     #<User id: 15, name: "a", email: "[email protected]", created_at: "2013-08-06 08:57:09", updated_at: "2013-08-06 08:57:09">,      #<User id: 16, name: "a1", email: "[email protected]", created_at: "2013-08-06 08:57:20", updated_at: "2013-08-06 08:57:20">,      #<User id: 17, name: "b", email: "[email protected]", created_at: "2013-08-06 08:57:28", updated_at: "2013-08-06 08:57:28">,      #<User id: 18, name: "b1", email: "[email protected]", created_at: "2013-08-06 08:57:35", updated_at: "2013-08-06 08:57:35">,      #<User id: 19, name: "b11", email: "[email protected]", created_at: "2013-08-06 09:01:30", updated_at: "2013-08-06 09:01:30">,      #<User id: 20, name: "b11", email: "[email protected]", created_at: "2013-08-06 09:07:58", updated_at: "2013-08-06 09:07:58">]  1.9.2p290 :099 >  

Email id's are duplicate, so our aim is to remove all duplicate email ids from user table.

Step 1:

To get all distinct email records id.

ids = User.select("MIN(id) as id").group(:email,:name).collect(&:id) => [15, 16, 18, 19, 17] 

Step 2:

To remove duplicate id's from user table with distinct email records id.

Now the ids array holds the following ids.

[15, 16, 18, 19, 17] User.where("id NOT IN (?)",ids)  # To get all duplicate records User.where("id NOT IN (?)",ids).destroy_all 

** RAILS 4 **

ActiveRecord 4 introduces the .not method which allows you to write the following in Step 2:

User.where.not(id: ids).destroy_all 
like image 43
Aravind encore Avatar answered Sep 20 '22 04:09

Aravind encore