I build large lists of high-level objects while parsing a tree. However, after this step, I have to remove duplicates from the list and I found this new step very slow in Python 2 (it was acceptable but still a little slow in Python 3). However I know that distinct objects actually have a distinct id. For this reason, I managed to get a much faster code by following these steps: <ul> <li>append all objects to a list while parsing;</li> <li>sort the list with <code>key=id</code> option;</li> <li>iterate over the sorted list and remove an element if the previous one has the same id.</li> </ul> Thus I have a working code which now runs smoothly, but I wonder whether I can achieve this task more directly in Python. Example. Let's build two identical objects having the same value but a different id (for instance I will take a <code>fractions.Fraction</code> in order to rely on the standard library): <pre class="prettyprint"><code>from fractions import Fraction a = Fraction(1,3) b = Fraction(1,3) </code></pre> Now, if I try to achieve what I want to do by using the pythonical <code>list(set(...))</code>, I get the wrong result as <code>{a,b}</code> keeps only one of both values (which are identical but have a different id). My question now is: what is the most pythonical, reliable, short and fast way for removing duplicates by id rather than duplicates by value? Order of the list doesn't matter if it has to be changed.

You should override the <code>__eq__</code> method so that it depends on the objects <code>id</code> rather than its values. But note that your objects must be hashable as well, so you should define a proper <code>__hash__</code> method too. <pre class="prettyprint"><code>class My_obj: def __init__(self, val): self.val = val def __hash__(self): return hash(self.val) def __eq__(self, arg): return id(self) == id(arg) def __repr__(self): return str(self.val) </code></pre> Demo: <pre class="prettyprint"><code>a = My_obj(5) b = My_obj(5) print({a, b}) {5, 5} </code></pre>

Removing duplicates in a Python list by id

I build large lists of high-level objects while parsing a tree. However, after this step, I have to remove duplicates from the list and I found this new step very slow in Python 2 (it was acceptable but still a little slow in Python 3). However I know that distinct objects actually have a distinct id. For this reason, I managed to get a much faster code by following these steps:

append all objects to a list while parsing;
sort the list with key=id option;
iterate over the sorted list and remove an element if the previous one has the same id.

Thus I have a working code which now runs smoothly, but I wonder whether I can achieve this task more directly in Python.

Example. Let's build two identical objects having the same value but a different id (for instance I will take a fractions.Fraction in order to rely on the standard library):

from fractions import Fraction
a = Fraction(1,3)
b = Fraction(1,3)

Now, if I try to achieve what I want to do by using the pythonical list(set(...)), I get the wrong result as {a,b} keeps only one of both values (which are identical but have a different id).

My question now is: what is the most pythonical, reliable, short and fast way for removing duplicates by id rather than duplicates by value? Order of the list doesn't matter if it has to be changed.

How do you remove duplicates from a list in Python?

To remove the duplicates from a list, you can make use of the built-in function set(). The specialty of the set() method is that it returns distinct elements. You can remove duplicates from the given list by importing OrderedDictfrom collections.

You should override the __eq__ method so that it depends on the objects id rather than its values. But note that your objects must be hashable as well, so you should define a proper __hash__ method too.

class My_obj:
    def __init__(self, val):
        self.val = val

    def __hash__(self):
        return hash(self.val)

    def __eq__(self, arg):
        return id(self) == id(arg)

    def __repr__(self):
        return str(self.val)

Demo:

a = My_obj(5)
b = My_obj(5)

print({a, b})
{5, 5}

Be, careful because discrimating by id may fail with some basic types where python optimizes storage when possible:

a = "foo"
b = "foo"
print(a is b)

yields

True

Anyway, if you want to handle standard objects (even non-hashable ones) you can store them in a dictionary with they id as key.

Example with fractions:

from fractions import Fraction
a = Fraction(1,3)
b = Fraction(1,3)

d = dict()

d[id(a)] = a
d[id(b)] = b

print(d.values())

result:

dict_values([Fraction(1, 3), Fraction(1, 3)])

Removing duplicates in a Python list by id

Tags:

python

list

duplicates

Thomas Baruchel

People also ask

2 Answers

Mazdak

Jean-François Fabre

Recent Activity

Donate For Us

Removing duplicates in a Python list by id

Tags:

python

list

duplicates

Thomas Baruchel

People also ask

2 Answers

Mazdak

Jean-François Fabre

Related questions

Recent Activity

Donate For Us