I frequently use sorted
and groupby
to find duplicates items in an iterable. Now I see it is unreliable:
from itertools import groupby
data = 3 * ('x ', (1,), u'x')
duplicates = [k for k, g in groupby(sorted(data)) if len(list(g)) > 1]
print duplicates
# [] printed - no duplicates found - like 9 unique values
The reason why the code above fails in Python 2.x is explained here.
What is a reliable pythonic way of finding duplicates?
I looked for similar questions/answers on SO. The best of them is "In Python, how do I take a list and reduce it to a list of duplicates?", but the accepted solution is not pythonic (it is procedural multiline for ... if ... add ... else ... add ... return result) and other solutions are unreliable (depends on unfulfilled transitivity of "<" operator) or are slow (O n*n).
[EDIT] Closed. The accepted answer helped me to summarize conclusions in my answer below more general.
I like to use builtin types to represent e.g. tree structures. This is why I am afraid of mix now.
DataFrame. duplicated() method is used to find duplicate rows in a DataFrame. It returns a boolean series which identifies whether a row is duplicate or unique.
Note: Assumes entries are hashable
>>> from collections import Counter
>>> data = 3 * ('x ', (1,), u'x')
>>> [k for k, c in Counter(data).iteritems() if c > 1]
[u'x', 'x ', (1,)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With