Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Alternative to python's .sort() (for inserting into a large list and keeping it sorted)

I need to continuously add numbers to a pre-sorted list:

for num in numberList:
    list.append(num)
    list.sort()

Each iteration is short but when the given numberList contains tens of thousands of values this method slows way down. Is there a more efficient function available that leaves a list intact and seeks out which index to insert a new number to preserve the correct order of numbers? Anything I've tried writing myself takes longer than .sort()

like image 904
user3555455 Avatar asked Jul 18 '15 17:07

user3555455


2 Answers

You can use the bisect.insort() function to insert values into an already sorted list:

from bisect import insort

insort(list, num)

Note that this'll still take some time as the remaining elements after the insertion point all have to be shifted up a step; you may want to think about re-implementing the list as a linked list instead.

However, if you are keeping the list sorted just to always be able to get the smallest or largest number, you should use the heapq module instead; a heap is not kept in strict sorted order, but is very efficient at giving you the either the smallest or largest value very quickly, at all times.

like image 51
Martijn Pieters Avatar answered Sep 30 '22 22:09

Martijn Pieters


See the native bisect.insort() which implements insertion sort on lists, this should perfectly fit your needs since the complexity is O(n) at best and O(n^2) at worst instead of O(nlogn) with your current solution (resorting after insertion).

However, there are faster alternatives to construct a sorted data structure, such as Skip Lists and Binary Search Trees which will allow insertion with complexity O(log n) at best and O(n) at worst, or even better B-trees, Red-Black trees, Splay trees and AVL trees which all have a complexity O(log n) at both best and worst cases. More infos about the complexity of all those solutions and others can be found in the great BigO CheatSheet by Eric Rowell. Note however that all those solutions require you to install a third-party module, and generally they need to be compiled with a C compiler.

However, there is a pure-python module called sortedcontainers, which claims to be as fast or faster than C compiled Python extensions of implementations of AVL trees and B-trees (benchmark available here).

I benchmarked a few solutions to see which is the fastest to do an insertion sort:

sortedcontainers: 0.0860911591881
bisect: 0.665865982912
skiplist: 1.49330501066
sort_insert: 17.4167637739

Here's the code I used to benchmark:

from timeit import Timer
setup = """
L = list(range(10000)) + list(range(10100, 30000))
from bisect import insort

def sort_insert(L, x):
    L.append(x)
    L.sort()

from lib.skiplist import SkipList
L2 = SkipList(allowDups=1)
for x in L:
    L2.insert(x)

from lib.sortedcontainers import SortedList
L3 = SortedList(L)
"""

# Using sortedcontainers.SortedList()
t_sortedcontainers = Timer("for i in xrange(10000, 10100): L3.add(i)", setup)
# Using bisect.insort()
t_bisect = Timer("for i in xrange(10000, 10100): insort(L, i)", setup)
# Using a Skip List
t_skiplist = Timer("for i in xrange(10000, 10100): L2.insert(i)", setup)
# Using a standard list insert and then sorting
t_sort_insert = Timer("for i in xrange(10000, 10100): sort_insert(L, i)", setup)

# Timing the results
print t_sortedcontainers.timeit(number=100)
print t_bisect.timeit(number=100)
print t_skiplist.timeit(number=100)
print t_sort_insert.timeit(number=100)

So the results indicate that the sortedcontainers is indeed almost 7x faster than bisect (and I expect the speed gap to increase with the list size since the complexity is an order of magnitude different).

What's more surprising is that the skip list is slower than bisect, but it's probably because it's not as optimized as bisect, which is implemented in C and may use some optimization tricks (note that the skiplist.py module I used was the fastest pure-Python Skip List I could find, the pyskip module being a lot slower).

Also worth of note: if you need to use more complex structures than lists, the sortedcontainers module offers SortedList, SortedListWithKey, SortedDict and SortedSet (while bisect only works on lists). Also, you might be interested by this somewhat related benchmark and this complexity cheatsheet of various Python operations.

like image 21
gaborous Avatar answered Sep 30 '22 23:09

gaborous