Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Return duplicate records (activerecord, postgres)

I have the following query returning duplicate titles, but :id is nil:

Movie.select(:title).group(:title).having("count(*) > 1")

[#<Movie:0x007f81f7111c20 id: nil, title: "Fargo">,
#<Movie:0x007f81f7111ab8 id: nil, title: "Children of Men">,
#<Movie:0x007f81f7111950 id: nil, title: "The Martian">,
#<Movie:0x007f81f71117e8 id: nil, title: "Gravity">]

I tried adding :id to the select and group but it returns an empty array. How can I return the whole movie record, not just the titles?

like image 920
Ashbury Avatar asked Oct 27 '15 06:10

Ashbury


1 Answers

A SQL-y Way

First, let's just solve the problem in SQL, so that the Rails-specific syntax doesn't trick us.

This SO question is a pretty clear parallel: Finding duplicate values in a SQL Table

The answer from KM (second from the top, non-checkmarked, at the moment) meets your criteria of returning all duplicated records along with their IDs. I've modified KM's SQL to match your table...

SELECT
  m.id, m.title
FROM 
  movies m
INNER JOIN (
  SELECT
    title, COUNT(*) AS CountOf
  FROM
    movies
  GROUP BY 
    title
  HAVING COUNT(*)>1
) dupes 
ON
  m.title=dupes.title

The portion inside the INNER JOIN ( ) is essentially what you've generated already. A grouped table of duplicated titles and counts. The trick is JOINing it to the unmodified movies table, which will exclude any movies that don't have matches in the query of dupes.

Why is this so hard to generate in Rails? The trickiest part is that, because we're JOINing movies to movies, we have to create table aliases (m and dupes in my query above).

Sadly, it Rails doesn't provide any clean ways of declaring these aliases. Some references:

  • Rails GitHub issues mentioning "join" and "alias". Misery.
  • SO Question: ActiveRecord query with alias'd table names

Fortunately, since we've got the SQL in-hand, we can use the .find_by_sql method...

Movie.find_by_sql("SELECT m.id, m.title FROM movies m INNER JOIN (SELECT title, COUNT(*) FROM movies GROUP BY title HAVING COUNT(*)>1) dupes ON m.first=.first")

Because we're calling Movie.find_by_sql, ActiveRecord assumes our hand-written SQL can be bundled into Movie objects. It doesn't massage or generate anything, which lets us do our aliases.

This approach has its shortcomings. It returns an array and not an ActiveRecord Relation, which means it can't be chained with other scopes. And, in the documentation for the find_by_sql method, we get extra discouragement...

This should be a last resort because using, for example, MySQL specific terms will lock you to using that particular database engine or require you to change your call if you switch engines.

A Rails-y Way

Really, what is the SQL doing above? It's getting a list of names that appear more than once. Then, it's matching that list against the original table. So, let's just do that using Rails.

titles_with_multiple = Movie.group(:title).having("count(title) > 1").count.keys

Movie.where(title: titles_with_multiple)

We call .keys because the first query returns an hash. The keys are our titles. The where() method can take an array, and we've handed it an array of titles. Winner.

You could argue one line of Ruby is more elegant than two. And if that one line of Ruby has an ungodly string of SQL embedded within it, how elegant is it really?

Hope this helps!

like image 91
Lanny Bose Avatar answered Oct 21 '22 22:10

Lanny Bose