How to sort an array of integers faster than quicksort?

Tags:

Sorting an array of integers with numpy's quicksort has become the bottleneck of my algorithm. Unfortunately, numpy does not have radix sort yet. Although counting sort would be a one-liner in numpy:

np.repeat(np.arange(1+x.max()), np.bincount(x))

see the accepted answer to the How can I vectorize this python count sort so it is absolutely as fast as it can be? question, the integers in my application can run from 0 to 2**32.

Am I stuck with quicksort?

_{This post was primarily motivated by the
[Numpy grouping using itertools.groupby performance](https://stackoverflow.com/q/4651683/341970)
question.
Also note that
[it is not merely OK to ask and answer your own question, it is explicitly encouraged.](https://blog.stackoverflow.com/2011/07/its-ok-to-ask-and-answer-your-own-questions/)}

808

asked Feb 10 '16 14:02

Ali

2 Answers

No, you are not stuck with quicksort. You could use, for example, integer_sort from Boost.Sort or u4_sort from usort. When sorting this array:

array(randint(0, high=1<<32, size=10**8), uint32)

I get the following results:

NumPy quicksort:         8.636 s  1.0  (baseline)
Boost.Sort integer_sort: 4.327 s  2.0x speedup
usort u4_sort:           2.065 s  4.2x speedup

I would not jump to conclusions based on this single experiment and use usort blindly. I would test with my actual data and measure what happens. Your mileage will vary depending on your data and on your machine. The integer_sort in Boost.Sort has a rich set of options for tuning, see the documentation.

Below I describe two ways to call a native C or C++ function from Python. Despite the long description, it's fairly easy to do it.

Boost.Sort

Put these lines into the spreadsort.cpp file:

#include <cinttypes>
#include "boost/sort/spreadsort/spreadsort.hpp"
using namespace boost::sort::spreadsort;

extern "C" {
    void spreadsort(std::uint32_t* begin,  std::size_t len) {
        integer_sort(begin, begin + len);
    }
}

It basically instantiates the templated integer_sort for 32 bit unsigned integers; the extern "C" part ensures C linkage by disabling name mangling. Assuming you are using gcc and that the necessary include files of boost are under the /tmp/boost_1_60_0 directory, you can compile it:

g++ -O3 -std=c++11 -march=native -DNDEBUG -shared -fPIC -I/tmp/boost_1_60_0 spreadsort.cpp -o spreadsort.so

The key flags are -fPIC to generate position-independet code and -shared to generate a shared object .so file. (Read the docs of gcc for further details.)

Then, you wrap the spreadsort() C++ function in Python using ctypes:

from ctypes import cdll, c_size_t, c_uint32
from numpy import uint32
from numpy.ctypeslib import ndpointer

__all__ = ['integer_sort']

# In spreadsort.cpp: void spreadsort(std::uint32_t* begin,  std::size_t len)
lib = cdll.LoadLibrary('./spreadsort.so')
sort = lib.spreadsort
sort.restype = None
sort.argtypes = [ndpointer(c_uint32, flags='C_CONTIGUOUS'), c_size_t]

def integer_sort(arr):
    assert arr.dtype == uint32, 'Expected uint32, got {}'.format(arr.dtype)
    sort(arr, arr.size)

Alternatively, you can use cffi:

from cffi import FFI
from numpy import uint32

__all__ = ['integer_sort']

ffi = FFI()
ffi.cdef('void spreadsort(uint32_t* begin,  size_t len);')
C = ffi.dlopen('./spreadsort.so')

def integer_sort(arr):
    assert arr.dtype == uint32, 'Expected uint32, got {}'.format(arr.dtype)
    begin = ffi.cast('uint32_t*', arr.ctypes.data)
    C.spreadsort(begin, arr.size)

At the cdll.LoadLibrary() and ffi.dlopen() calls I assumed that the path to the spreadsort.so file is ./spreadsort.so. Alternatively, you can write

lib = cdll.LoadLibrary('spreadsort.so')

C = ffi.dlopen('spreadsort.so')

if you append the path to spreadsort.so to the LD_LIBRARY_PATH environment variable. See also Shared Libraries.

Usage. In both cases you simply call the above Python wrapper function integer_sort() with your numpy array of 32 bit unsigned integers.

usort

As for u4_sort, you can compile it as follows:

cc -DBUILDING_u4_sort -I/usr/include -I./ -I../ -I../../ -I../../../ -I../../../../ -std=c99 -fgnu89-inline -O3 -g -fPIC -shared -march=native u4_sort.c -o u4_sort.so

Issue this command in the directory where the u4_sort.c file is located. (Probably there is a less hackish way but I failed to figure that out. I just looked into the deps.mk file in the usort directory to find out the necessary compiler flags and include paths.)

Then, you can wrap the C function as follows:

from cffi import FFI
from numpy import uint32

__all__ = ['integer_sort']

ffi = FFI()
ffi.cdef('void u4_sort(unsigned* a, const long sz);')
C = ffi.dlopen('u4_sort.so')

def integer_sort(arr):
    assert arr.dtype == uint32, 'Expected uint32, got {}'.format(arr.dtype)
    begin = ffi.cast('unsigned*', arr.ctypes.data)
    C.u4_sort(begin, arr.size)

In the above code, I assumed that the path to u4_sort.so has been appended to the LD_LIBRARY_PATH environment variable.

Usage. As before with Boost.Sort, you simply call the above Python wrapper function integer_sort() with your numpy array of 32 bit unsigned integers.

127

answered Oct 06 '22 06:10

Ali

A radix sort base 256 (1 byte) can generate a matrix of counts used to determine bucket size in 1 pass of the data, then takes 4 passes to do the sort. On my system, Intel 2600K 3.4ghz, using Visual Studio release build for a C++ program, it takes about 1.15 seconds to sort 10^8 psuedo random unsigned 32 bit integers.

Looking at the u4_sort code mentioned in Ali's answer, the programs are similar, but u4_sort uses field sizes of {10,11,11}, taking 3 passes to sort data and 1 pass to copy back, while this example uses field sizes of {8,8,8,8}, taking 4 passes to sort data. The process is probably memory bandwidth limited due to the random access writes, so the optimizations in u4_sort, like macros for shift, one loop with fixed shift per field, aren't helping much. My results are better probably due to system and/or compiler differences. (Note u8_sort is for 64 bit integers).

Example code:

//  a is input array, b is working array
void RadixSort(uint32_t * a, uint32_t *b, size_t count)
{
size_t mIndex[4][256] = {0};            // count / index matrix
size_t i,j,m,n;
uint32_t u;
    for(i = 0; i < count; i++){         // generate histograms
        u = a[i];
        for(j = 0; j < 4; j++){
            mIndex[j][(size_t)(u & 0xff)]++;
            u >>= 8;
        }       
    }
    for(j = 0; j < 4; j++){             // convert to indices
        m = 0;
        for(i = 0; i < 256; i++){
            n = mIndex[j][i];
            mIndex[j][i] = m;
            m += n;
        }       
    }
    for(j = 0; j < 4; j++){             // radix sort
        for(i = 0; i < count; i++){     //  sort by current lsb
            u = a[i];
            m = (size_t)(u>>(j<<3))&0xff;
            b[mIndex[j][m]++] = u;
        }
        std::swap(a, b);                //  swap ptrs
    }
}

answered Oct 06 '22 06:10

rcgldr

Related questions
                            
                                How can I export an instance and all its related objects in Django?
                            
                                Pass Python list to embedded Rust function
                            
                                Alternative to `any` that returns the last evaluated object?
                            
                                Selenium pdf automatic download not working
                            
                                Flask SqlAlchemy join two models without foreign key MYSQL
                            
                                How to estimate density function and calculate its peaks?
                            
                                Python watchdog windows wait till copy finishes
                            
                                Get the length of reversed list
                            
                                Assign line colors in pandas
                            
                                How can I check if one two-dimensional NumPy array contains a specific pattern of values inside it?
                            
                                How to serialize custom user model in DRF
                            
                                Get cell color from .xlsx
                            
                                Dot product with dictionaries
                            
                                Addition of list and NumPy number
                            
                                How to use nltk regex pattern to extract a specific phrase chunk?
                            
                                Python Pandas - Read csv file containing multiple tables
                            
                                How to check if all elements in a tuple or list are in another?
                            
                                Convert an Array, converted to a String, back to an Array
                            
                                Merge lists in Python by placing every nth item from one list and others from another?
                            
                                Django backup strategy with dumpdata and migrations

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to sort an array of integers faster than quicksort?

Tags:

performance

python

algorithm

sorting

numpy

Ali

People also ask

2 Answers

Ali

rcgldr

Recent Activity

Donate For Us