I have a large number of objects I need to store in memory for processing in Python. Specifically, I'm trying to remove duplicates from a large set of objects. I want to consider two objects "equal" if a certain instance variable in the object is equal. So, I assumed the easiest way to do this would be to insert all my objects into a set, and override the __hash__
method so that it hashes the instance variable I'm concerned with.
So, as a test I tried the following:
class Person:
def __init__(self, n, a):
self.name = n
self.age = a
def __hash__(self):
return hash(self.name)
def __str__(self):
return "{0}:{1}".format(self.name, self.age)
myset = set()
myset.add(Person("foo", 10))
myset.add(Person("bar", 20))
myset.add(Person("baz", 30))
myset.add(Person("foo", 1000)) # try adding a duplicate
for p in myset: print(p)
Here, I define a Person
class, and any two instances of Person
with the same name
variable are to be equal, regardless of the value of any other instance variable. Unfortunately, this outputs:
baz:30
foo:10
bar:20
foo:1000
Note that foo
appears twice, so this program failed to notice duplicates. Yet the expression hash(Person("foo", 10)) == hash(Person("foo", 1000))
is True
. So why doesn't this properly detect duplicate Person
objects?
If you want to count duplicates for a given element then use the count() function. Use a counter() function or basics logic combination to find all duplicated elements in a list and count them in Python.
If you want to identify duplicates across the entire data set, then select the entire set. Navigate to the Home tab and select the Conditional Formatting button. In the Conditional Formatting menu, select Highlight Cells Rules. In the menu that pops up, select Duplicate Values.
You forgot to also define __eq__()
.
If a class does not define a
__cmp__()
or__eq__()
method it should not define a__hash__()
operation either; if it defines__cmp__()
or__eq__()
but not__hash__()
, its instances will not be usable in hashed collections. If a class defines mutable objects and implements a__cmp__()
or__eq__()
method, it should not implement__hash__()
, since hashable collection implementations require that a object’s hash value is immutable (if the object’s hash value changes, it will be in the wrong hash bucket).
A set obviously will have to deal with hash collisions. If the hash of two objects matches, the set will compare them using the ==
operator to make sure they are really equal. In your case, this will only yield True
if the two objects are the same object (the standard implementation for user-defined classes).
Long story short: Also define __eq__()
to make it work.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With