Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing duplicate elements from a Python list containing unhashable elements while preserving order?

Tags:

python

I have a datastructure like this:

[
[('A', '1'), ('B', '2')],
[('A', '1'), ('B', '2')],
[('A', '4'), ('C', '5')]
]

And I want to obtain this:

[
[('A', '1'), ('B', '2')],
[('A', '4'), ('C', '5')]
]

Is there a good way of doing this while preserving order as shown?

Commands for copy-pasting:

sample = []
sample.append([('A', '1'), ('B', '2')])
sample.append([('A', '1'), ('B', '2')])
sample.append([('A', '4'), ('C', '5')])
like image 833
Legend Avatar asked Nov 01 '11 22:11

Legend


2 Answers

This is a somewhat famous question which was well answered by a famous Pythonista long ago: http://code.activestate.com/recipes/52560-remove-duplicates-from-a-sequence/

If you can assume equal records are adjacent, there is a recipe in the itertools docs:

from operator import itemgetter
from itertools import groupby, imap

def unique_justseen(iterable, key=None):
    "List unique elements, preserving order. Remember only the element just seen."
    # unique_justseen('AAAABBBCCDAABBB') --> A B C D A B
    # unique_justseen('ABBCcAD', str.lower) --> A B C A D
    return imap(next, imap(itemgetter(1), groupby(iterable, key)))

If you can only assume orderable elements, here a variant using the bisect module. Given n inputs with r unique values, its search step costs O(n log r). If a new unique value is found, it is inserted in the seen list for a cost of O(r * r).

from bisect import bisect_left, insort

def dedup(seq):
    'Remove duplicates. Preserve order first seen.  Assume orderable, but not hashable elements'
    result = []
    seen = []
    for x in seq:
        i = bisect_left(seen, x)
        if i == len(seen) or seen[i] != x:
            seen.insert(i, x)
            result.append(x)
    return result
like image 195
Raymond Hettinger Avatar answered Sep 28 '22 08:09

Raymond Hettinger


Here is an order-preserving variant of the sort/unique idiom. This will give you O(n log n) performance, provided your items are at least sortable.

def unique(a):
    indices = sorted(range(len(a)), key=a.__getitem__)
    indices = set(next(it) for k, it in 
                  itertools.groupby(indices, key=a.__getitem__))
    return [x for i, x in enumerate(a) if i in indices]

Example (with hashable items for simplicity):

>>> a = ['F', 'J', 'B', 'F', 'V', 'A', 'E', 'U', 'B', 'U', 'Z', 'K']
>>> unique(a)
['F', 'J', 'B', 'V', 'A', 'E', 'U', 'Z', 'K']
like image 23
Sven Marnach Avatar answered Sep 28 '22 08:09

Sven Marnach