Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing duplicates in a Python list by id

I build large lists of high-level objects while parsing a tree. However, after this step, I have to remove duplicates from the list and I found this new step very slow in Python 2 (it was acceptable but still a little slow in Python 3). However I know that distinct objects actually have a distinct id. For this reason, I managed to get a much faster code by following these steps:

  • append all objects to a list while parsing;
  • sort the list with key=id option;
  • iterate over the sorted list and remove an element if the previous one has the same id.

Thus I have a working code which now runs smoothly, but I wonder whether I can achieve this task more directly in Python.

Example. Let's build two identical objects having the same value but a different id (for instance I will take a fractions.Fraction in order to rely on the standard library):

from fractions import Fraction
a = Fraction(1,3)
b = Fraction(1,3)

Now, if I try to achieve what I want to do by using the pythonical list(set(...)), I get the wrong result as {a,b} keeps only one of both values (which are identical but have a different id).

My question now is: what is the most pythonical, reliable, short and fast way for removing duplicates by id rather than duplicates by value? Order of the list doesn't matter if it has to be changed.

like image 864
Thomas Baruchel Avatar asked Nov 19 '16 08:11

Thomas Baruchel


People also ask

How do you remove duplicates from a list in Python?

To remove the duplicates from a list, you can make use of the built-in function set(). The specialty of the set() method is that it returns distinct elements. You can remove duplicates from the given list by importing OrderedDictfrom collections.


2 Answers

You should override the __eq__ method so that it depends on the objects id rather than its values. But note that your objects must be hashable as well, so you should define a proper __hash__ method too.

class My_obj:
    def __init__(self, val):
        self.val = val

    def __hash__(self):
        return hash(self.val)

    def __eq__(self, arg):
        return id(self) == id(arg)

    def __repr__(self):
        return str(self.val)

Demo:

a = My_obj(5)
b = My_obj(5)

print({a, b})
{5, 5}
like image 148
Mazdak Avatar answered Oct 02 '22 22:10

Mazdak


Be, careful because discrimating by id may fail with some basic types where python optimizes storage when possible:

a = "foo"
b = "foo"
print(a is b)

yields

True

Anyway, if you want to handle standard objects (even non-hashable ones) you can store them in a dictionary with they id as key.

Example with fractions:

from fractions import Fraction
a = Fraction(1,3)
b = Fraction(1,3)

d = dict()

d[id(a)] = a
d[id(b)] = b

print(d.values())

result:

dict_values([Fraction(1, 3), Fraction(1, 3)])
like image 28
Jean-François Fabre Avatar answered Oct 02 '22 23:10

Jean-François Fabre