Huge difference in timing between sorting a set vs sorting a list in Python

Tags:

I was wondering whether I should have my data structure as a set or a list. Mostly I will do set operations, but in the end I will need to sort it.

I wondered whether I should first make the set a list, and then use sorted(list(my_set)), or just sort the set immediately sorted(my_set). Arguably, I might consider a general "to list" phase, since having an ordered iterable at that point in time might make sense anyway.

So I decided to test it, expecting a list to be quicker.

Benchmarker:

import time
def sorter(x):
    t1 = time.time()
    for i in range(1000000):
        sorted(x)
    return time.time() - t1

Data:

one = range(1000)
a1 = list(one)
b1 = set(one)
sorter(a1)
# time: 16.5 s 
sorter(b1)
# time: 20.7 s

I then realized it might have to do with the fact that the elements are already in place, and remembered this amazing question & answer.

Then, I tried some random data:

two = numpy.random.randint(1, 1000, 1000)
a2 = list(two)
b2 = set(two)

With the results:

sorter(a2)
# time: 4min 49s
sorter(b2)
# time: 18.9 s

Huge difference, what is going on?

Bonus: It even appears at a timing of one minute, that sorted(set(a_list)) is impressively faster than sorted(a_list).

Indeed in the second case, there can be duplicates which would be filtered, and thus speed up sorting.

235

asked Dec 23 '14 00:12

PascalVKooten

1 Answers

I extended your code a bit and hope that this will give you insight into what is happening:

import numpy
import uuid
import random
import time

def sorter(x):
    t1 = time.time()
    for i in range(10000):
        sorted(x)
    return time.time() - t1

def pr(name, x):
    print('sorter {:<12s} {:<11} (length {:>4})'.format(
        name, '{:.8}'.format(sorter(x)), len(x)))

a2sizes = []
b2sizes = []

for x in range(1000):
    two = numpy.random.randint(1, 1000, 1000)
    a2 = list(two)
    b2 = set(two)
    a2sizes.append(len(a2))
    b2sizes.append(len(b2))

print 'average number of elements in a2', sum(a2sizes)/len(a2sizes)
n = sum(b2sizes)/len(b2sizes)
print 'average number of elements in b2', n

this prints out:

average number of elements in a2 1000
average number of elements in b2 632

This is because of the collision in the random number range

print
pr('a2', a2)
# making a list of set gives you already sorted elements
y = list(b2)
pr('y', y)
random.shuffle(y)
pr('shuffled y ', y)
pr('b2', b2)

gives as output:

sorter a2           2.492537    (length 1000)
sorter b2           0.25028086  (length  633)
sorter y            0.19689608  (length  633)
sorter shuffled y   1.4935901   (length  633)

That b2 would be faster because there are less elements is logical, but that this is even faster if you first make a list of the set must have some reason. That it again is slower if you shuffle that list is again logical and the shuffled result is rather close to the result for a2 when compensated for the length of the lists.

So lets try to put something else in the list:

b3 = set()
for x in range(1000):
    b3.add(uuid.uuid4())

print '\nuuid elements', len(b3)

a3 = list(b3)
pr('a3', a3)
random.shuffle(a3)
pr('shuffled a3', a3)
pr('b3', b3)

gives (I would have been rather surprised to have less than 1000 elements):

uuid elements 1000
sorter a3           32.437758   (length 1000)
sorter shuffled a3  32.178433   (length 1000)
sorter b3           32.163802   (length 1000)

So it must have something to do with having numbers in the set:

previous = -1
ordered = True
for popped in b2:
    if popped < previous:
        print 'popped', popped, previous
        ordered = False
    previous = popped

print '\nOrdered', ordered

gives you:

Ordered True

Instead of iterating , a set has a pop() function you can try and use:

pop()

Remove and return an arbitrary element from the set. Raises KeyError if the set is empty.

So lets arbitrarily retrieve elements from the set b2 and see if there is something special:

previous = -1
ordered = True
while(b2):
    popped = b2.pop()
    if popped < previous:
        print 'popped', popped, previous
        ordered = False
    previous = popped

print '\nOrdered', ordered

gives the same result:

Ordered True

So arbitrarily retrieving elements of set of number retrieves those numbers in order, independent off the ordering how these numbers were put in. As the iteration is how the list-making retrieves an element at a time to append to the list, the result of list(b2) is an ordered list and that gets sorted quite fast using the Timsort algorithm used in Python.

102

answered Oct 05 '22 09:10

Anthon

Related questions
                            
                                Least cpu intensive method of checking if a list has changed in c#
                            
                                Counting duplicate numbers in a list
                            
                                Check if a list of integers increments by one
                            
                                How to find length of an element in a list?
                            
                                Using Haskell's map function to calculate the sum of a list
                            
                                Most Pythonic Way to Build Dictionary From Single List
                            
                                Python: Transform a Dictionary into a list of lists
                            
                                Check if list is flat in Haskell
                            
                                How to copy List to Array
                            
                                Python Running cumulative sum with a given window
                            
                                Filter a list to only leave objects that occur once
                            
                                How can I remove multiple characters in a list?
                            
                                What's the most pythonic way to ensure that all elements of a list are different?
                            
                                What is the easiest way to foreach through a List<T> removing unwanted objects?
                            
                                Create List With Numbers Getting Greater Each Time Python [closed]
                            
                                jackson xml lists deserialization recognized as duplicate keys
                            
                                How do I delete/count objects in a s3 bucket?
                            
                                Android: Getting list of device applications - Very Slow
                            
                                What is the time complexity of name look-up in an R list?
                            
                                List and Linq To Sql Performance Issue

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Huge difference in timing between sorting a set vs sorting a list in Python

Tags:

list

sorting

set

python-2.7

PascalVKooten

People also ask

1 Answers

Anthon

Recent Activity

Donate For Us