Numpy's string functions are all very slow and are less performant than pure python lists. I am looking to optimize all the normal string functions using Cython.
For instance, let's take a numpy array of 100,000 unicode strings with data type either unicode or object and lowecase each one.
alist = ['JsDated', 'УКРАЇНА'] * 50000
arr_unicode = np.array(alist)
arr_object = np.array(alist, dtype='object')
%timeit np.char.lower(arr_unicode)
51.6 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using a list comprehension is just as fast
%timeit [a.lower() for a in arr_unicode]
44.7 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
For the object data type, we cannot use np.char
. The list comprehension is 3x as fast.
%timeit [a.lower() for a in arr_object]
16.1 ms ± 147 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The only way I know how to do this in Cython is to create an empty object array and call the Python string method lower
on each iteration.
import numpy as np
cimport numpy as np
from numpy cimport ndarray
def lower(ndarray[object] arr):
cdef int i
cdef int n = len(arr)
cdef ndarray[object] result = np.empty(n, dtype='object')
for i in range(n):
result[i] = arr[i].lower()
return result
This yields a modest improvement
%timeit lower(arr_object)
11.3 ms ± 383 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I have tried accessing the memory directly with the data
ndarray attribute like this:
def lower_fast(ndarray[object] arr):
cdef int n = len(arr)
cdef int i
cdef char* data = arr.data
cdef int itemsize = arr.itemsize
for i in range(n):
# no idea here
I believe data
is one contiguous piece of memory holding all the raw bytes one after the other. Accessing these bytes is extremely fast and it seems converting these raw bytes would increase performance by 2 orders of magnitude. I found a tolower
c++ function that might work, but I don't know how to hook it in with Cython.
Here is the fastest method I found by far, from another SO post. This lowercases all the ascii characters by accessing the numpy memoryview via the data
attribute. I think it will mangle other unicode characters that have bytes between 65 and 90 as well. But the speed is very good.
cdef int f(char *a, int itemsize, int shape):
cdef int i
cdef int num
cdef int loc
for i in range(shape * itemsize):
num = a[i]
print(num)
if 65 <= num <= 90:
a[i] +=32
def lower_fast(ndarray arr):
cdef char *inp
inp = arr.data
f(inp, arr.itemsize, arr.shape[0])
return arr
This is 100x faster than the others and what I am looking for.
%timeit lower_fast(arr)
103 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
You can use NumPy from Cython exactly the same as in regular Python, but by doing so you are losing potentially high speedups because Cython has support for fast access to NumPy arrays.
pandas provides a bunch of C or Cython optimized functions that can be faster than the NumPy equivalent function (e.g. reading text from text files).
NumPy is fast because it can do all its calculations without calling back into Python. Since this function involves looping in Python, we lose all the performance benefits of using NumPy. For a 10,000,000-entry NumPy array, this functions takes 2.5 seconds to run on my computer.
This was only slightly faster than the list comprehension for me on my machine, but if you want unicode support this might be the fastest way of doing it. You'll need to apt-get install libunistring-dev
or whatever is appropriate for your OS / package manager.
In some C file, say, _lower.c
, have
#include <stdlib.h>
#include <string.h>
#include <unistr.h>
#include <unicase.h>
void _c_tolower(uint8_t **s, uint32_t total_len) {
size_t lower_len, s_len;
uint8_t *s_ptr = *s, *lowered;
while(s_ptr - *s < total_len) {
s_len = u8_strlen(s_ptr);
if (s_len == 0) {
s_ptr += 1;
continue;
}
lowered = u8_tolower(s_ptr, s_len, NULL, NULL, NULL, &lower_len);
memcpy(s_ptr, lowered, lower_len);
free(lowered);
s_ptr += s_len;
}
}
Then, in lower.pxd
you do
cdef extern from "_lower.c":
cdef void _c_tolower(unsigned char **s, unsigned int total_len)
And finally, in lower.pyx
:
cpdef void lower(ndarray arr):
cdef unsigned char * _arr
_arr = <unsigned char *> arr.data
_c_tolower(&_arr, arr.shape[0] * arr.itemsize)
On my laptop, I got 46ms for the list comprehension you had above and 37ms for this method (and 0.8ms for your lower_fast
), so it's probably not worth it, but I figured I'd type it out in case you wanted an example of how to hook such a thing into Cython.
There are a few points of improvement that I don't know will make much of a difference:
arr.data
is something like a square matrix I guess? (I don't know, I don't use numpy for anything), and pads the ends of the shorter strings with \x00
s. I was too lazy to figure out how to get u8_tolower
to look past the 0s, so I just manually fast-forward past them (that's what the if (s_len == 0)
clause is doing). I suspect that one call to u8_tolower
would be significantly faster than doing it thousands of times. Not really an answer, but hope it helps your further investigations!
PS You'll notice that this does the lowering in-place, so the usage would be like this:
>>> alist = ['JsDated', 'УКРАЇНА', '道德經', 'Ну И йЕшШо'] * 2
>>> arr_unicode = np.array(alist)
>>> lower_2(arr_unicode)
>>> for x in arr_unicode:
... print x
...
jsdated
україна
道德經
ну и йешшо
jsdated
україна
道德經
ну и йешшо
>>> alist = ['JsDated', 'УКРАЇНА'] * 50000
>>> arr_unicode = np.array(alist)
>>> ct = time(); x = [a.lower() for a in arr_unicode]; time() - ct;
0.046072959899902344
>>> arr_unicode = np.array(alist)
>>> ct = time(); lower_2(arr_unicode); time() - ct
0.037489891052246094
EDIT
DUH, you modify the C function to look like this
void _c_tolower(uint8_t **s, uint32_t total_len) {
size_t lower_len;
uint8_t *lowered;
lowered = u8_tolower(*s, total_len, NULL, NULL, NULL, &lower_len);
memcpy(*s, lowered, lower_len);
free(lowered);
}
and then it does it all in one go. Looks more dangerous in terms of possibly having something from the old data left over of lower_len
is shorter than the original string... in short, this code is TOTALLY EXPERIMENTAL AND FOR ILLUSTRATIVE PURPOSES ONLY DO NOT USE THIS IN PRODUCTION IT WILL PROBABLY BREAK.
Anyway, ~40% faster this way:
>>> alist = ['JsDated', 'УКРАЇНА'] * 50000
>>> arr_unicode = np.array(alist)
>>> ct = time(); lower_2(arr_unicode); time() - ct
0.022463043975830078
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With