Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is a good pythonic way of finding duplicate objects?

I frequently use sorted and groupby to find duplicates items in an iterable. Now I see it is unreliable:

from itertools import groupby
data = 3 * ('x ',  (1,), u'x')
duplicates = [k for k, g in groupby(sorted(data)) if len(list(g)) > 1]
print duplicates
# [] printed - no duplicates found - like 9 unique values

The reason why the code above fails in Python 2.x is explained here.

What is a reliable pythonic way of finding duplicates?

I looked for similar questions/answers on SO. The best of them is "In Python, how do I take a list and reduce it to a list of duplicates?", but the accepted solution is not pythonic (it is procedural multiline for ... if ... add ... else ... add ... return result) and other solutions are unreliable (depends on unfulfilled transitivity of "<" operator) or are slow (O n*n).

[EDIT] Closed. The accepted answer helped me to summarize conclusions in my answer below more general.

I like to use builtin types to represent e.g. tree structures. This is why I am afraid of mix now.

like image 601
hynekcer Avatar asked Apr 20 '12 14:04

hynekcer


People also ask

How do you find duplicates in a data frame?

DataFrame. duplicated() method is used to find duplicate rows in a DataFrame. It returns a boolean series which identifies whether a row is duplicate or unique.


1 Answers

Note: Assumes entries are hashable

>>> from collections import Counter
>>> data = 3 * ('x ',  (1,), u'x')
>>> [k for k, c in Counter(data).iteritems() if c > 1]
[u'x', 'x ', (1,)]
like image 87
jamylak Avatar answered Nov 16 '22 02:11

jamylak