I've been immensely frustrated with many of the implementations of python radix sort out there on the web. They consistently use a radix of 10 and get the digits of the numbers they iterate over by dividing by a power of 10 or taking the log10 of the number. This is incredibly inefficient, as log10 is not a particularly quick operation compared to bit shifting, which is nearly 100 times faster! A much more efficient implementation uses a radix of 256 and sorts the number byte by byte. This allows for all of the 'byte getting' to be done using the ridiculously quick bit operators. Unfortunately, it seems that absolutely nobody out there has implemented a radix sort in python that uses bit operators instead of logarithms. So, I took matters into my own hands and came up with this beast, which runs at about half the speed of sorted on small arrays and runs nearly as quickly on larger ones (e.g. <code>len</code> around 10,000,000): <pre class="prettyprint"><code>import itertools def radix_sort(unsorted): "Fast implementation of radix sort for any size num." maximum, minimum = max(unsorted), min(unsorted) max_bits = maximum.bit_length() highest_byte = max_bits // 8 if max_bits % 8 == 0 else (max_bits // 8) + 1 min_bits = minimum.bit_length() lowest_byte = min_bits // 8 if min_bits % 8 == 0 else (min_bits // 8) + 1 sorted_list = unsorted for offset in xrange(lowest_byte, highest_byte): sorted_list = radix_sort_offset(sorted_list, offset) return sorted_list def radix_sort_offset(unsorted, offset): "Helper function for radix sort, sorts each offset." byte_check = (0xFF << offset*8) buckets = [[] for _ in xrange(256)] for num in unsorted: byte_at_offset = (num & byte_check) >> offset*8 buckets[byte_at_offset].append(num) return list(itertools.chain.from_iterable(buckets)) </code></pre> This version of radix sort works by finding which bytes it has to sort by (if you pass it only integers below 256, it'll sort just one byte, etc.) then sorting each byte from LSB up by dumping them into buckets in order then just chaining the buckets together. Repeat this for each byte that needs to be sorted and you have your nice sorted array in O(n) time. However, it's not as fast as it could be, and I'd like to make it faster before I write about it as a better radix sort than all the other radix sorts out there. Running <code>cProfile</code> on this tells me that a lot of time is being spent on the <code>append</code> method for lists, which makes me think that this block: <pre class="prettyprint"><code> for num in unsorted: byte_at_offset = (num & byte_check) >> offset*8 buckets[byte_at_offset].append(num) </code></pre> in <code>radix_sort_offset</code> is eating a lot of time. This is also the block that, if you really look at it, does 90% of the work for the whole sort. This code looks like it could be <code>numpy</code>-ized, which I think would result in quite a performance boost. Unfortunately, I'm not very good with <code>numpy</code>'s more complex features so haven't been able to figure that out. Help would be very appreciated. I'm currently using <code>itertools.chain.from_iterable</code> to flatten the <code>buckets</code>, but if anyone has a faster suggestion I'm sure it would help as well. Originally, I had a <code>get_byte</code> function that returned the <code>n</code>th byte of a number, but inlining the code gave me a huge speed boost so I did it. Any other comments on the implementation or ways to squeeze out more performance are also appreciated. I want to hear anything and everything you've got.

You already realized that <pre class="prettyprint"><code>for num in unsorted: byte_at_offset = (num & byte_check) >> offset*8 buckets[byte_at_offset].append(num) </code></pre> is where most of the time goes - good ;-) There are two standard tricks for speeding that kind of thing, both having to do with moving invariants out of loops: <ol> <li>Compute "offset*8" outside the loop. Store it in a local variable. Save a multiplication per iteration.</li> <li>Add <code>bucketappender = [bucket.append for bucket in buckets]</code> outside the loop. Saves a method lookup per iteration.</li> </ol> Combine them, and the loop looks like: <pre class="prettyprint"><code>for num in unsorted: bucketappender[(num & byte_check) >> ofs8](num) </code></pre> Collapsing it to one statement also saves a pair of local vrbl store/fetch opcodes per iteration. But, at a higher level, the standard way to speed radix sort is to use a larger radix. What's magical about 256? Nothing, apart from that it's convenient for bit-shifting. But so are 512, 1024, 2048 ... it's a classical time/space tradeoff. PS: for very long numbers, <pre class="prettyprint"><code>(num >> offset*8) & 0xff </code></pre> will run faster. That's because your <code>num & byte_check</code> takes time proportional to <code>log(num)</code> - it generally has to create an integer about as big as <code>num</code>.

Pushing Radix Sort (and python) to its limits

Tags:

python

optimization

sorting

numpy

radix-sort

I've been immensely frustrated with many of the implementations of python radix sort out there on the web.

They consistently use a radix of 10 and get the digits of the numbers they iterate over by dividing by a power of 10 or taking the log10 of the number. This is incredibly inefficient, as log10 is not a particularly quick operation compared to bit shifting, which is nearly 100 times faster!

A much more efficient implementation uses a radix of 256 and sorts the number byte by byte. This allows for all of the 'byte getting' to be done using the ridiculously quick bit operators. Unfortunately, it seems that absolutely nobody out there has implemented a radix sort in python that uses bit operators instead of logarithms.

So, I took matters into my own hands and came up with this beast, which runs at about half the speed of sorted on small arrays and runs nearly as quickly on larger ones (e.g. len around 10,000,000):

import itertools

def radix_sort(unsorted):
    "Fast implementation of radix sort for any size num."
    maximum, minimum = max(unsorted), min(unsorted)

    max_bits = maximum.bit_length()
    highest_byte = max_bits // 8 if max_bits % 8 == 0 else (max_bits // 8) + 1

    min_bits = minimum.bit_length()
    lowest_byte = min_bits // 8 if min_bits % 8 == 0 else (min_bits // 8) + 1

    sorted_list = unsorted
    for offset in xrange(lowest_byte, highest_byte):
        sorted_list = radix_sort_offset(sorted_list, offset)

    return sorted_list

def radix_sort_offset(unsorted, offset):
    "Helper function for radix sort, sorts each offset."
    byte_check = (0xFF << offset*8)

    buckets = [[] for _ in xrange(256)]

    for num in unsorted:
        byte_at_offset = (num & byte_check) >> offset*8
        buckets[byte_at_offset].append(num)

    return list(itertools.chain.from_iterable(buckets))

This version of radix sort works by finding which bytes it has to sort by (if you pass it only integers below 256, it'll sort just one byte, etc.) then sorting each byte from LSB up by dumping them into buckets in order then just chaining the buckets together. Repeat this for each byte that needs to be sorted and you have your nice sorted array in O(n) time.

However, it's not as fast as it could be, and I'd like to make it faster before I write about it as a better radix sort than all the other radix sorts out there.

Running cProfile on this tells me that a lot of time is being spent on the append method for lists, which makes me think that this block:

    for num in unsorted:
        byte_at_offset = (num & byte_check) >> offset*8
        buckets[byte_at_offset].append(num)

in radix_sort_offset is eating a lot of time. This is also the block that, if you really look at it, does 90% of the work for the whole sort. This code looks like it could be numpy-ized, which I think would result in quite a performance boost. Unfortunately, I'm not very good with numpy's more complex features so haven't been able to figure that out. Help would be very appreciated.

I'm currently using itertools.chain.from_iterable to flatten the buckets, but if anyone has a faster suggestion I'm sure it would help as well.

Originally, I had a get_byte function that returned the nth byte of a number, but inlining the code gave me a huge speed boost so I did it.

Any other comments on the implementation or ways to squeeze out more performance are also appreciated. I want to hear anything and everything you've got.

292

asked Nov 26 '13 03:11

reem

1 Answers

You already realized that

for num in unsorted:
    byte_at_offset = (num & byte_check) >> offset*8
    buckets[byte_at_offset].append(num)

is where most of the time goes - good ;-)

There are two standard tricks for speeding that kind of thing, both having to do with moving invariants out of loops:

Compute "offset*8" outside the loop. Store it in a local variable. Save a multiplication per iteration.
Add bucketappender = [bucket.append for bucket in buckets] outside the loop. Saves a method lookup per iteration.

Combine them, and the loop looks like:

for num in unsorted:
    bucketappender[(num & byte_check) >> ofs8](num)

Collapsing it to one statement also saves a pair of local vrbl store/fetch opcodes per iteration.

But, at a higher level, the standard way to speed radix sort is to use a larger radix. What's magical about 256? Nothing, apart from that it's convenient for bit-shifting. But so are 512, 1024, 2048 ... it's a classical time/space tradeoff.

PS: for very long numbers,

(num >> offset*8) & 0xff

will run faster. That's because your num & byte_check takes time proportional to log(num) - it generally has to create an integer about as big as num.

answered Sep 23 '22 23:09

Tim Peters

Related questions
                            
                                Python, functional programming, mapping to a higher level
                            
                                Python: deepcopy does not work on user-defined classes?
                            
                                Python safe dict navigation, The Right Way
                            
                                websocket server for django application
                            
                                How to get a python script to listen for inputs from another script
                            
                                printing a variable value to 2 decimal places [duplicate]
                            
                                Getting meta tags from a page source using Selenium Python
                            
                                Python runtime_library_dirs doesn't work on Mac
                            
                                Determining the range of values returned by Python's hash()
                            
                                elasticsearch bulk indexing using python
                            
                                Should I use abstract classes?
                            
                                In numpy, how to efficiently list all fixed-size submatrices?
                            
                                Python PIL Paste
                            
                                Caching a generator
                            
                                appcfg.py command not found
                            
                                Reverting a URL in Flask to the endpoint + arguments
                            
                                Listing a filter object twice will return a blank list?
                            
                                In-place row-wise operation on pandas DataFrame
                            
                                What is the best way to send multiple HTTP requests in Python 3? [duplicate]
                            
                                How to output logging.info and logging.debug to console?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With