I build large lists of high-level objects while parsing a tree. However, after this step, I have to remove duplicates from the list and I found this new step very slow in Python 2 (it was acceptable but still a little slow in Python 3). However I know that distinct objects actually have a distinct id. For this reason, I managed to get a much faster code by following these steps:
key=id
option;Thus I have a working code which now runs smoothly, but I wonder whether I can achieve this task more directly in Python.
Example. Let's build two identical objects having the same value but a different id (for instance I will take a fractions.Fraction
in order to rely on the standard library):
from fractions import Fraction
a = Fraction(1,3)
b = Fraction(1,3)
Now, if I try to achieve what I want to do by using the pythonical list(set(...))
, I get the wrong result as {a,b}
keeps only one of both values (which are identical but have a different id).
My question now is: what is the most pythonical, reliable, short and fast way for removing duplicates by id rather than duplicates by value? Order of the list doesn't matter if it has to be changed.
To remove the duplicates from a list, you can make use of the built-in function set(). The specialty of the set() method is that it returns distinct elements. You can remove duplicates from the given list by importing OrderedDictfrom collections.
You should override the __eq__
method so that it depends on the objects id
rather than its values. But note that your objects must be hashable as well, so you should define a proper __hash__
method too.
class My_obj:
def __init__(self, val):
self.val = val
def __hash__(self):
return hash(self.val)
def __eq__(self, arg):
return id(self) == id(arg)
def __repr__(self):
return str(self.val)
Demo:
a = My_obj(5)
b = My_obj(5)
print({a, b})
{5, 5}
Be, careful because discrimating by id
may fail with some basic types where python optimizes storage when possible:
a = "foo"
b = "foo"
print(a is b)
yields
True
Anyway, if you want to handle standard objects (even non-hashable ones) you can store them in a dictionary with they id
as key.
Example with fractions:
from fractions import Fraction
a = Fraction(1,3)
b = Fraction(1,3)
d = dict()
d[id(a)] = a
d[id(b)] = b
print(d.values())
result:
dict_values([Fraction(1, 3), Fraction(1, 3)])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With