Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I remove identical items from a list and sort it in Python?

How can I most optimally remove identical items from a list and sort it in Python?

Say I have a list:

my_list = ['a', 'a', 'b', 'c', 'd', 'a', 'e', 'd', 'f', 'e']

I could iterate over a copy of the list (since you should not mutate the list while iterating over it), item for item, and remove all of the item except for one:

for item in my_list[:]: # must iterate over a copy because mutating it
    count = my_list.count(item) # see how many are in the list
    if count > 1:
        for _ in range(count-1): # remove all but one of the element
            my_list.remove(item)

Which removes the redundant items:

['b', 'c', 'a', 'd', 'f', 'e']

and then sort the list:

my_list.sort()

so my_list is now:

['a', 'b', 'c', 'd', 'e', 'f']

But what's the most efficient and direct (i.e. performant) way to remove the identical elements and sort this list?

*This question came up at work (I wanted so badly to answer it, but one of our senior-most Python developers got to it before me), and I also brought it up at my local Python Meetup group, and few people had a good answer for it, so I'm answering it Q&A style, as suggested by Stackoverflow.

like image 400
Russia Must Remove Putin Avatar asked Mar 30 '14 06:03

Russia Must Remove Putin


1 Answers

The best way to remove redundant elements from a list is to cast it as a set, and since sorted accepts any iterable and returns a list, this is far more efficient than doing this piecewise.

my_list = ['a', 'a', 'b', 'c', 'd', 'a', 'e', 'd', 'f', 'e']

def sorted_set(a_list):
    return sorted(set(a_list))

new_list = sorted_set(my_list)

and new_list is:

['a', 'b', 'c', 'd', 'e', 'f']

The downside of this approach is that elements given to set must be hashable, so if the elements are unhashable, you'll get an error:

>>> my_list = [['a'], ['a'], ['b'], ['c']]
>>> sorted(set(my_list))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'

This trivial case could be addressed by casting the sublists as tuples, which may be more performant than the solution in the answer, which could mean more expensive tests for equality:

>>> my_list = [tuple(i) for i in my_list]
>>> sorted(set(my_list))
[('a',), ('b',), ('c',)]

But other cases would need to find different workarounds. This would not be necessary with the other solution (but again, may be much more computationally expensive):

def remove_extras_and_sort(my_list):
    for item in my_list[:]:
        count = my_list.count(item)
        if count > 1:
            for _ in range(count-1):
                my_list.remove(item)
    my_list.sort()
    return my_list

Which works for sublists:

>>> my_list = [['a'], ['a'], ['b'], ['c']]
>>> remove_extras_and_sort(my_list)
[['a'], ['b'], ['c']]

To compare performance:

import timeit

setup = '''
my_list = ['a', 'a', 'b', 'c', 'd', 'a', 'e', 'd', 'f', 'e']
def remove_extras_and_sort(my_list):
    for item in my_list[:]:
        count = my_list.count(item)
        if count > 1:
            for _ in range(count-1):
                my_list.remove(item)
    my_list.sort()
    return my_list

def sorted_set(a_list):
    return sorted(set(a_list))
'''

timeit.timeit('sorted_set(my_list[:])', setup=setup)
timeit.timeit('remove_extras_and_sort(my_list[:])', setup=setup)

Which returns times as I measure them on my system, respectively:

1.5562372207641602
4.558010101318359

Which means that the method given in the question likely takes more than 3 times as long to compute, given the necessary overhead for copying the lists each time (if we don't copy the lists, we'll just be sorting a list that's already been sorted, since the setup only runs once).


We can disassemble each function:

import dis

def remove_extras_and_sort(my_list):
    for item in my_list[:]:
        count = my_list.count(item)
        if count > 1:
            for _ in range(count-1):
                my_list.remove(item)
    my_list.sort()
    return my_list

def sorted_set(a_list):
    return sorted(set(a_list))

And just by looking at the output, we see that the bytecode for the first function is more than six times as long:

>>> dis.dis(remove_extras_and_sort)
  2           0 SETUP_LOOP              85 (to 88)
              3 LOAD_FAST                0 (my_list)
              6 SLICE+0             
              7 GET_ITER            
        >>    8 FOR_ITER                76 (to 87)
             11 STORE_FAST               1 (item)

  3          14 LOAD_FAST                0 (my_list)
             17 LOAD_ATTR                0 (count)
             20 LOAD_FAST                1 (item)
             23 CALL_FUNCTION            1
             26 STORE_FAST               2 (count)

  4          29 LOAD_FAST                2 (count)
             32 LOAD_CONST               1 (1)
             35 COMPARE_OP               4 (>)
             38 POP_JUMP_IF_FALSE        8

  5          41 SETUP_LOOP              40 (to 84)
             44 LOAD_GLOBAL              1 (range)
             47 LOAD_FAST                2 (count)
             50 LOAD_CONST               1 (1)
             53 BINARY_SUBTRACT     
             54 CALL_FUNCTION            1
             57 GET_ITER            
        >>   58 FOR_ITER                19 (to 80)
             61 STORE_FAST               3 (_)

  6          64 LOAD_FAST                0 (my_list)
             67 LOAD_ATTR                2 (remove)
             70 LOAD_FAST                1 (item)
             73 CALL_FUNCTION            1
             76 POP_TOP             
             77 JUMP_ABSOLUTE           58
        >>   80 POP_BLOCK           
             81 JUMP_ABSOLUTE            8
        >>   84 JUMP_ABSOLUTE            8
        >>   87 POP_BLOCK           

  7     >>   88 LOAD_FAST                0 (my_list)
             91 LOAD_ATTR                3 (sort)
             94 CALL_FUNCTION            0
             97 POP_TOP             

  8          98 LOAD_FAST                0 (my_list)
            101 RETURN_VALUE        

And the recommended way has much shorter bytecode:

>>> dis.dis(sorted_set)
  2           0 LOAD_GLOBAL              0 (sorted)
              3 LOAD_GLOBAL              1 (set)
              6 LOAD_FAST                0 (a_list)
              9 CALL_FUNCTION            1
             12 CALL_FUNCTION            1
             15 RETURN_VALUE        

So we see that using the builtin functionality of Python is much more effective and efficient than trying to reinvent the wheel.


Addendum: other options that need to be more fully explored:

def groupby_sorted(my_list):
    """if items in my_list are unhashable"""
    from itertools import groupby
    return [e for e, g in groupby(sorted(my_list))]

def preserve_order_encountered(my_list):
    """elements in argument must be hashable - preserves order encountered"""
    from collections import OrderedDict
    return list(OrderedDict.fromkeys(my_list))
like image 82
Russia Must Remove Putin Avatar answered Oct 06 '22 12:10

Russia Must Remove Putin