How can I most optimally remove identical items from a list and sort it in Python?
Say I have a list:
my_list = ['a', 'a', 'b', 'c', 'd', 'a', 'e', 'd', 'f', 'e']
I could iterate over a copy of the list (since you should not mutate the list while iterating over it), item for item, and remove all of the item except for one:
for item in my_list[:]: # must iterate over a copy because mutating it
count = my_list.count(item) # see how many are in the list
if count > 1:
for _ in range(count-1): # remove all but one of the element
my_list.remove(item)
Which removes the redundant items:
['b', 'c', 'a', 'd', 'f', 'e']
and then sort the list:
my_list.sort()
so my_list is now:
['a', 'b', 'c', 'd', 'e', 'f']
But what's the most efficient and direct (i.e. performant) way to remove the identical elements and sort this list?
*This question came up at work (I wanted so badly to answer it, but one of our senior-most Python developers got to it before me), and I also brought it up at my local Python Meetup group, and few people had a good answer for it, so I'm answering it Q&A style, as suggested by Stackoverflow.
The best way to remove redundant elements from a list is to cast it as a set, and since sorted accepts any iterable and returns a list, this is far more efficient than doing this piecewise.
my_list = ['a', 'a', 'b', 'c', 'd', 'a', 'e', 'd', 'f', 'e']
def sorted_set(a_list):
return sorted(set(a_list))
new_list = sorted_set(my_list)
and new_list is:
['a', 'b', 'c', 'd', 'e', 'f']
The downside of this approach is that elements given to set must be hashable, so if the elements are unhashable, you'll get an error:
>>> my_list = [['a'], ['a'], ['b'], ['c']]
>>> sorted(set(my_list))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
This trivial case could be addressed by casting the sublists as tuples, which may be more performant than the solution in the answer, which could mean more expensive tests for equality:
>>> my_list = [tuple(i) for i in my_list]
>>> sorted(set(my_list))
[('a',), ('b',), ('c',)]
But other cases would need to find different workarounds. This would not be necessary with the other solution (but again, may be much more computationally expensive):
def remove_extras_and_sort(my_list):
for item in my_list[:]:
count = my_list.count(item)
if count > 1:
for _ in range(count-1):
my_list.remove(item)
my_list.sort()
return my_list
Which works for sublists:
>>> my_list = [['a'], ['a'], ['b'], ['c']]
>>> remove_extras_and_sort(my_list)
[['a'], ['b'], ['c']]
import timeit
setup = '''
my_list = ['a', 'a', 'b', 'c', 'd', 'a', 'e', 'd', 'f', 'e']
def remove_extras_and_sort(my_list):
for item in my_list[:]:
count = my_list.count(item)
if count > 1:
for _ in range(count-1):
my_list.remove(item)
my_list.sort()
return my_list
def sorted_set(a_list):
return sorted(set(a_list))
'''
timeit.timeit('sorted_set(my_list[:])', setup=setup)
timeit.timeit('remove_extras_and_sort(my_list[:])', setup=setup)
Which returns times as I measure them on my system, respectively:
1.5562372207641602
4.558010101318359
Which means that the method given in the question likely takes more than 3 times as long to compute, given the necessary overhead for copying the lists each time (if we don't copy the lists, we'll just be sorting a list that's already been sorted, since the setup only runs once).
import dis
def remove_extras_and_sort(my_list):
for item in my_list[:]:
count = my_list.count(item)
if count > 1:
for _ in range(count-1):
my_list.remove(item)
my_list.sort()
return my_list
def sorted_set(a_list):
return sorted(set(a_list))
And just by looking at the output, we see that the bytecode for the first function is more than six times as long:
>>> dis.dis(remove_extras_and_sort)
2 0 SETUP_LOOP 85 (to 88)
3 LOAD_FAST 0 (my_list)
6 SLICE+0
7 GET_ITER
>> 8 FOR_ITER 76 (to 87)
11 STORE_FAST 1 (item)
3 14 LOAD_FAST 0 (my_list)
17 LOAD_ATTR 0 (count)
20 LOAD_FAST 1 (item)
23 CALL_FUNCTION 1
26 STORE_FAST 2 (count)
4 29 LOAD_FAST 2 (count)
32 LOAD_CONST 1 (1)
35 COMPARE_OP 4 (>)
38 POP_JUMP_IF_FALSE 8
5 41 SETUP_LOOP 40 (to 84)
44 LOAD_GLOBAL 1 (range)
47 LOAD_FAST 2 (count)
50 LOAD_CONST 1 (1)
53 BINARY_SUBTRACT
54 CALL_FUNCTION 1
57 GET_ITER
>> 58 FOR_ITER 19 (to 80)
61 STORE_FAST 3 (_)
6 64 LOAD_FAST 0 (my_list)
67 LOAD_ATTR 2 (remove)
70 LOAD_FAST 1 (item)
73 CALL_FUNCTION 1
76 POP_TOP
77 JUMP_ABSOLUTE 58
>> 80 POP_BLOCK
81 JUMP_ABSOLUTE 8
>> 84 JUMP_ABSOLUTE 8
>> 87 POP_BLOCK
7 >> 88 LOAD_FAST 0 (my_list)
91 LOAD_ATTR 3 (sort)
94 CALL_FUNCTION 0
97 POP_TOP
8 98 LOAD_FAST 0 (my_list)
101 RETURN_VALUE
And the recommended way has much shorter bytecode:
>>> dis.dis(sorted_set)
2 0 LOAD_GLOBAL 0 (sorted)
3 LOAD_GLOBAL 1 (set)
6 LOAD_FAST 0 (a_list)
9 CALL_FUNCTION 1
12 CALL_FUNCTION 1
15 RETURN_VALUE
So we see that using the builtin functionality of Python is much more effective and efficient than trying to reinvent the wheel.
Addendum: other options that need to be more fully explored:
def groupby_sorted(my_list):
"""if items in my_list are unhashable"""
from itertools import groupby
return [e for e, g in groupby(sorted(my_list))]
def preserve_order_encountered(my_list):
"""elements in argument must be hashable - preserves order encountered"""
from collections import OrderedDict
return list(OrderedDict.fromkeys(my_list))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With