Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Django select only rows with duplicate field values

suppose we have a model in django defined as follows:

class Literal:
    name = models.CharField(...)
    ...

Name field is not unique, and thus can have duplicate values. I need to accomplish the following task: Select all rows from the model that have at least one duplicate value of the name field.

I know how to do it using plain SQL (may be not the best solution):

select * from literal where name IN (
    select name from literal group by name having count((name)) > 1
);

So, is it possible to select this using django ORM? Or better SQL solution?

like image 225
dragoon Avatar asked Jan 24 '12 15:01

dragoon


People also ask

How do I avoid duplicates in select query?

The SQL DISTINCT keyword, which we have already discussed is used in conjunction with the SELECT statement to eliminate all the duplicate records and by fetching only the unique records.

How can you filter the duplicate data while retrieving records from the table?

Once you have grouped data you can filter out duplicates by using having clause. Having clause is the counterpart of where clause for aggregation queries. Just remember to provide a temporary name to count() data in order to use them in having clause.

Does select return duplicate rows?

If you do not include DISTINCT in a SELECT clause, you might find duplicate rows in your result, because SQL returns the JOB column's value for each row that satisfies the search condition. Null values are treated as duplicate rows for DISTINCT.

Which is the function used to retrieve all the rows without eliminating duplicate values?

Introduction to SQL DISTINCT operator It doesn't delete duplicate rows in the table. If you want to select two columns and remove duplicates in one column, you should use the GROUP BY clause instead.


3 Answers

Try:

from django.db.models import Count Literal.objects.values('name')                .annotate(Count('id'))                 .order_by()                .filter(id__count__gt=1) 

This is as close as you can get with Django. The problem is that this will return a ValuesQuerySet with only name and count. However, you can then use this to construct a regular QuerySet by feeding it back into another query:

dupes = Literal.objects.values('name')                        .annotate(Count('id'))                        .order_by()                        .filter(id__count__gt=1) Literal.objects.filter(name__in=[item['name'] for item in dupes]) 
like image 181
Chris Pratt Avatar answered Oct 17 '22 21:10

Chris Pratt


This was rejected as an edit. So here it is as a better answer

dups = (     Literal.objects.values('name')     .annotate(count=Count('id'))     .values('name')     .order_by()     .filter(count__gt=1) ) 

This will return a ValuesQuerySet with all of the duplicate names. However, you can then use this to construct a regular QuerySet by feeding it back into another query. The django ORM is smart enough to combine these into a single query:

Literal.objects.filter(name__in=dups) 

The extra call to .values('name') after the annotate call looks a little strange. Without this, the subquery fails. The extra values tricks the ORM into only selecting the name column for the subquery.

like image 40
Piper Merriam Avatar answered Oct 17 '22 22:10

Piper Merriam


try using aggregation

Literal.objects.values('name').annotate(name_count=Count('name')).exclude(name_count=1)
like image 29
JamesO Avatar answered Oct 17 '22 20:10

JamesO