Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: detect duplicates using a set

I have a large number of objects I need to store in memory for processing in Python. Specifically, I'm trying to remove duplicates from a large set of objects. I want to consider two objects "equal" if a certain instance variable in the object is equal. So, I assumed the easiest way to do this would be to insert all my objects into a set, and override the __hash__ method so that it hashes the instance variable I'm concerned with.

So, as a test I tried the following:

class Person:
    def __init__(self, n, a):
        self.name = n
        self.age = a

    def __hash__(self):
        return hash(self.name)

    def __str__(self):
        return "{0}:{1}".format(self.name, self.age)

myset = set()
myset.add(Person("foo", 10))
myset.add(Person("bar", 20))
myset.add(Person("baz", 30))
myset.add(Person("foo", 1000)) # try adding a duplicate

for p in myset: print(p)

Here, I define a Person class, and any two instances of Person with the same name variable are to be equal, regardless of the value of any other instance variable. Unfortunately, this outputs:

baz:30
foo:10
bar:20
foo:1000

Note that foo appears twice, so this program failed to notice duplicates. Yet the expression hash(Person("foo", 10)) == hash(Person("foo", 1000)) is True. So why doesn't this properly detect duplicate Person objects?

like image 523
Channel72 Avatar asked May 12 '11 13:05

Channel72


People also ask

Can we count duplicates in set in Python?

If you want to count duplicates for a given element then use the count() function. Use a counter() function or basics logic combination to find all duplicated elements in a list and count them in Python.

How do you find duplicates in a set of data?

If you want to identify duplicates across the entire data set, then select the entire set. Navigate to the Home tab and select the Conditional Formatting button. In the Conditional Formatting menu, select Highlight Cells Rules. In the menu that pops up, select Duplicate Values.


2 Answers

You forgot to also define __eq__().

If a class does not define a __cmp__() or __eq__() method it should not define a __hash__() operation either; if it defines __cmp__() or __eq__() but not __hash__(), its instances will not be usable in hashed collections. If a class defines mutable objects and implements a __cmp__() or __eq__() method, it should not implement __hash__(), since hashable collection implementations require that a object’s hash value is immutable (if the object’s hash value changes, it will be in the wrong hash bucket).

like image 122
Ignacio Vazquez-Abrams Avatar answered Sep 21 '22 13:09

Ignacio Vazquez-Abrams


A set obviously will have to deal with hash collisions. If the hash of two objects matches, the set will compare them using the == operator to make sure they are really equal. In your case, this will only yield True if the two objects are the same object (the standard implementation for user-defined classes).

Long story short: Also define __eq__() to make it work.

like image 28
Sven Marnach Avatar answered Sep 20 '22 13:09

Sven Marnach