Sorting an array of integers with numpy's quicksort has become the bottleneck of my algorithm. Unfortunately, numpy does not have radix sort yet. Although counting sort would be a one-liner in numpy:
np.repeat(np.arange(1+x.max()), np.bincount(x))
see the accepted answer to the How can I vectorize this python count sort so it is absolutely as fast as it can be? question, the integers
in my application can run from 0
to 2**32
.
Am I stuck with quicksort?
Merge sort is more efficient and works faster than quick sort in case of larger array size or datasets.
If you've observed, the time complexity of Quicksort is O(n logn) in the best and average case scenarios and O(n^2) in the worst case. But since it has the upper hand in the average cases for most inputs, Quicksort is generally considered the “fastest” sorting algorithm.
Quicksort. Quicksort is one of the most efficient sorting algorithms, and this makes of it one of the most used as well. The first thing to do is to select a pivot number, this number will separate the data, on its left are the numbers smaller than it and the greater numbers on the right.
But because it has the best performance in the average case for most inputs, Quicksort is generally considered the “fastest” sorting algorithm.
No, you are not stuck with quicksort. You could use, for example,
integer_sort
from
Boost.Sort
or u4_sort
from usort. When sorting this array:
array(randint(0, high=1<<32, size=10**8), uint32)
I get the following results:
NumPy quicksort: 8.636 s 1.0 (baseline) Boost.Sort integer_sort: 4.327 s 2.0x speedup usort u4_sort: 2.065 s 4.2x speedup
I would not jump to conclusions based on this single experiment and use
usort
blindly. I would test with my actual data and measure what happens.
Your mileage will vary depending on your data and on your machine. The
integer_sort
in Boost.Sort has a rich set of options for tuning, see the
documentation.
Below I describe two ways to call a native C or C++ function from Python. Despite the long description, it's fairly easy to do it.
Boost.Sort
Put these lines into the spreadsort.cpp file:
#include <cinttypes>
#include "boost/sort/spreadsort/spreadsort.hpp"
using namespace boost::sort::spreadsort;
extern "C" {
void spreadsort(std::uint32_t* begin, std::size_t len) {
integer_sort(begin, begin + len);
}
}
It basically instantiates the templated integer_sort
for 32 bit
unsigned integers; the extern "C"
part ensures C linkage by disabling
name mangling.
Assuming you are using gcc and that the necessary include files of boost
are under the /tmp/boost_1_60_0
directory, you can compile it:
g++ -O3 -std=c++11 -march=native -DNDEBUG -shared -fPIC -I/tmp/boost_1_60_0 spreadsort.cpp -o spreadsort.so
The key flags are -fPIC
to generate
position-independet code
and -shared
to generate a
shared object
.so file. (Read the docs of gcc for further details.)
Then, you wrap the spreadsort()
C++ function
in Python using ctypes
:
from ctypes import cdll, c_size_t, c_uint32
from numpy import uint32
from numpy.ctypeslib import ndpointer
__all__ = ['integer_sort']
# In spreadsort.cpp: void spreadsort(std::uint32_t* begin, std::size_t len)
lib = cdll.LoadLibrary('./spreadsort.so')
sort = lib.spreadsort
sort.restype = None
sort.argtypes = [ndpointer(c_uint32, flags='C_CONTIGUOUS'), c_size_t]
def integer_sort(arr):
assert arr.dtype == uint32, 'Expected uint32, got {}'.format(arr.dtype)
sort(arr, arr.size)
Alternatively, you can use cffi:
from cffi import FFI
from numpy import uint32
__all__ = ['integer_sort']
ffi = FFI()
ffi.cdef('void spreadsort(uint32_t* begin, size_t len);')
C = ffi.dlopen('./spreadsort.so')
def integer_sort(arr):
assert arr.dtype == uint32, 'Expected uint32, got {}'.format(arr.dtype)
begin = ffi.cast('uint32_t*', arr.ctypes.data)
C.spreadsort(begin, arr.size)
At the cdll.LoadLibrary()
and ffi.dlopen()
calls I assumed that the
path to the spreadsort.so
file is ./spreadsort.so
. Alternatively,
you can write
lib = cdll.LoadLibrary('spreadsort.so')
or
C = ffi.dlopen('spreadsort.so')
if you append the path to spreadsort.so
to the LD_LIBRARY_PATH
environment
variable. See also Shared Libraries.
Usage. In both cases you simply call the above Python wrapper function integer_sort()
with your numpy array of 32 bit unsigned integers.
usort
As for u4_sort
, you can compile it as follows:
cc -DBUILDING_u4_sort -I/usr/include -I./ -I../ -I../../ -I../../../ -I../../../../ -std=c99 -fgnu89-inline -O3 -g -fPIC -shared -march=native u4_sort.c -o u4_sort.so
Issue this command in the directory where the u4_sort.c
file is located.
(Probably there is a less hackish way but I failed to figure that out. I
just looked into the deps.mk file in the usort directory to find out
the necessary compiler flags and include paths.)
Then, you can wrap the C function as follows:
from cffi import FFI
from numpy import uint32
__all__ = ['integer_sort']
ffi = FFI()
ffi.cdef('void u4_sort(unsigned* a, const long sz);')
C = ffi.dlopen('u4_sort.so')
def integer_sort(arr):
assert arr.dtype == uint32, 'Expected uint32, got {}'.format(arr.dtype)
begin = ffi.cast('unsigned*', arr.ctypes.data)
C.u4_sort(begin, arr.size)
In the above code, I assumed that the path to u4_sort.so
has been
appended to the LD_LIBRARY_PATH
environment variable.
Usage. As before with Boost.Sort, you simply call the above Python wrapper function integer_sort()
with your numpy array of 32 bit unsigned integers.
A radix sort base 256 (1 byte) can generate a matrix of counts used to determine bucket size in 1 pass of the data, then takes 4 passes to do the sort. On my system, Intel 2600K 3.4ghz, using Visual Studio release build for a C++ program, it takes about 1.15 seconds to sort 10^8 psuedo random unsigned 32 bit integers.
Looking at the u4_sort code mentioned in Ali's answer, the programs are similar, but u4_sort uses field sizes of {10,11,11}, taking 3 passes to sort data and 1 pass to copy back, while this example uses field sizes of {8,8,8,8}, taking 4 passes to sort data. The process is probably memory bandwidth limited due to the random access writes, so the optimizations in u4_sort, like macros for shift, one loop with fixed shift per field, aren't helping much. My results are better probably due to system and/or compiler differences. (Note u8_sort is for 64 bit integers).
Example code:
// a is input array, b is working array
void RadixSort(uint32_t * a, uint32_t *b, size_t count)
{
size_t mIndex[4][256] = {0}; // count / index matrix
size_t i,j,m,n;
uint32_t u;
for(i = 0; i < count; i++){ // generate histograms
u = a[i];
for(j = 0; j < 4; j++){
mIndex[j][(size_t)(u & 0xff)]++;
u >>= 8;
}
}
for(j = 0; j < 4; j++){ // convert to indices
m = 0;
for(i = 0; i < 256; i++){
n = mIndex[j][i];
mIndex[j][i] = m;
m += n;
}
}
for(j = 0; j < 4; j++){ // radix sort
for(i = 0; i < count; i++){ // sort by current lsb
u = a[i];
m = (size_t)(u>>(j<<3))&0xff;
b[mIndex[j][m]++] = u;
}
std::swap(a, b); // swap ptrs
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With