Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find rows with multiple duplicate fields with Active Record, Rails & Postgres

People also ask

How do you find duplicate rows in pandas based on multiple columns?

Find Duplicate Rows based on all columns To find & select the duplicate all rows based on all columns call the Daraframe. duplicate() without any subset argument. It will return a Boolean series with True at the place of each duplicated rows except their first occurrence (default value of keep argument is 'first').

How do I find duplicate rows in a column in R?

We can find the rows with duplicated values in a particular column of an R data frame by using duplicated function inside the subset function. This will return only the duplicate rows based on the column we choose that means the first unique value will not be in the output.


Tested & Working Version

User.select(:first,:email).group(:first,:email).having("count(*) > 1")

Also, this is a little unrelated but handy. If you want to see how times each combination was found, put .size at the end:

User.select(:first,:email).group(:first,:email).having("count(*) > 1").size

and you'll get a result set back that looks like this:

{[nil, nil]=>512,
 ["Joe", "[email protected]"]=>23,
 ["Jim", "[email protected]"]=>36,
 ["John", "[email protected]"]=>21}

Thought that was pretty cool and hadn't seen it before.

Credit to Taryn, this is just a tweaked version of her answer.


That error occurs because POSTGRES requires you to put grouping columns in the SELECT clause.

try:

User.select(:first,:email).group(:first,:email).having("count(*) > 1").all

(note: not tested, you may need to tweak it)

EDITED to remove id column


If you need the full models, try the following (based on @newUserNameHere's answer).

User.where(email: User.select(:email).group(:email).having("count(*) > 1").select(:email))

This will return the rows where the email address of the row is not unique.

I'm not aware of a way to do this over multiple attributes.


Get all duplicates with a single query if you use PostgreSQL:

def duplicated_users
  duplicated_ids = User
    .group(:first, :email)
    .having("COUNT(*) > 1")
    .select('unnest((array_agg("id"))[2:])')

  User.where(id: duplicated_ids)
end

irb> duplicated_users