Numpy's string functions are all very slow and are less performant than pure python lists. I am looking to optimize all the normal string functions using Cython. For instance, let's take a numpy array of 100,000 unicode strings with data type either unicode or object and lowecase each one. <pre class="prettyprint"><code>alist = ['JsDated', 'УКРАЇНА'] * 50000 arr_unicode = np.array(alist) arr_object = np.array(alist, dtype='object') %timeit np.char.lower(arr_unicode) 51.6 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) </code></pre> Using a list comprehension is just as fast <pre class="prettyprint"><code>%timeit [a.lower() for a in arr_unicode] 44.7 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) </code></pre> For the object data type, we cannot use <code>np.char</code>. The list comprehension is 3x as fast. <pre class="prettyprint"><code>%timeit [a.lower() for a in arr_object] 16.1 ms ± 147 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) </code></pre> The only way I know how to do this in Cython is to create an empty object array and call the Python string method <code>lower</code> on each iteration. <pre class="prettyprint"><code>import numpy as np cimport numpy as np from numpy cimport ndarray def lower(ndarray[object] arr): cdef int i cdef int n = len(arr) cdef ndarray[object] result = np.empty(n, dtype='object') for i in range(n): result[i] = arr[i].lower() return result </code></pre> This yields a modest improvement <pre class="prettyprint"><code>%timeit lower(arr_object) 11.3 ms ± 383 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) </code></pre> I have tried accessing the memory directly with the <code>data</code> ndarray attribute like this: <pre class="prettyprint"><code>def lower_fast(ndarray[object] arr): cdef int n = len(arr) cdef int i cdef char* data = arr.data cdef int itemsize = arr.itemsize for i in range(n): # no idea here </code></pre> I believe <code>data</code> is one contiguous piece of memory holding all the raw bytes one after the other. Accessing these bytes is extremely fast and it seems converting these raw bytes would increase performance by 2 orders of magnitude. I found a <code>tolower</code> c++ function that might work, but I don't know how to hook it in with Cython. <h3>Update with fastest method (doesn't work for unicode)</h3> Here is the fastest method I found by far, from another SO post. This lowercases all the ascii characters by accessing the numpy memoryview via the <code>data</code> attribute. I think it will mangle other unicode characters that have bytes between 65 and 90 as well. But the speed is very good. <pre class="prettyprint"><code>cdef int f(char *a, int itemsize, int shape): cdef int i cdef int num cdef int loc for i in range(shape * itemsize): num = a[i] print(num) if 65 <= num <= 90: a[i] +=32 def lower_fast(ndarray arr): cdef char *inp inp = arr.data f(inp, arr.itemsize, arr.shape[0]) return arr </code></pre> This is 100x faster than the others and what I am looking for. <pre class="prettyprint"><code>%timeit lower_fast(arr) 103 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) </code></pre>

This was only slightly faster than the list comprehension for me on my machine, but if you want unicode support this might be the fastest way of doing it. You'll need to <code>apt-get install libunistring-dev</code> or whatever is appropriate for your OS / package manager. In some C file, say, <code>_lower.c</code>, have <pre class="prettyprint"><code>#include <stdlib.h> #include <string.h> #include <unistr.h> #include <unicase.h> void _c_tolower(uint8_t **s, uint32_t total_len) { size_t lower_len, s_len; uint8_t *s_ptr = *s, *lowered; while(s_ptr - *s < total_len) { s_len = u8_strlen(s_ptr); if (s_len == 0) { s_ptr += 1; continue; } lowered = u8_tolower(s_ptr, s_len, NULL, NULL, NULL, &lower_len); memcpy(s_ptr, lowered, lower_len); free(lowered); s_ptr += s_len; } } </code></pre> Then, in <code>lower.pxd</code> you do <pre class="prettyprint"><code>cdef extern from "_lower.c": cdef void _c_tolower(unsigned char **s, unsigned int total_len) </code></pre> And finally, in <code>lower.pyx</code>: <pre class="prettyprint"><code>cpdef void lower(ndarray arr): cdef unsigned char * _arr _arr = <unsigned char *> arr.data _c_tolower(&_arr, arr.shape[0] * arr.itemsize) </code></pre> On my laptop, I got 46ms for the list comprehension you had above and 37ms for this method (and 0.8ms for your <code>lower_fast</code>), so it's probably not worth it, but I figured I'd type it out in case you wanted an example of how to hook such a thing into Cython. There are a few points of improvement that I don't know will make much of a difference: <ul> <li> <code>arr.data</code> is something like a square matrix I guess? (I don't know, I don't use numpy for anything), and pads the ends of the shorter strings with <code>\x00</code>s. I was too lazy to figure out how to get <code>u8_tolower</code> to look past the 0s, so I just manually fast-forward past them (that's what the <code>if (s_len == 0)</code> clause is doing). I suspect that one call to <code>u8_tolower</code> would be significantly faster than doing it thousands of times. </li> <li>I'm doing a lot of freeing/memcpying. You can probably avoid that if you're clever.</li> <li>I think it's the case that every lowercase unicode character is at most as wide as its uppercase variant, so this should not run into any segfaults or buffer overwrites or just overlapping substring issues, but don't take my word for that. </li> </ul> Not really an answer, but hope it helps your further investigations! PS You'll notice that this does the lowering in-place, so the usage would be like this: <pre class="prettyprint"><code>>>> alist = ['JsDated', 'УКРАЇНА', '道德經', 'Ну И йЕшШо'] * 2 >>> arr_unicode = np.array(alist) >>> lower_2(arr_unicode) >>> for x in arr_unicode: ... print x ... jsdated україна 道德經 ну и йешшо jsdated україна 道德經 ну и йешшо >>> alist = ['JsDated', 'УКРАЇНА'] * 50000 >>> arr_unicode = np.array(alist) >>> ct = time(); x = [a.lower() for a in arr_unicode]; time() - ct; 0.046072959899902344 >>> arr_unicode = np.array(alist) >>> ct = time(); lower_2(arr_unicode); time() - ct 0.037489891052246094 </code></pre> EDIT DUH, you modify the C function to look like this <pre class="prettyprint"><code>void _c_tolower(uint8_t **s, uint32_t total_len) { size_t lower_len; uint8_t *lowered; lowered = u8_tolower(*s, total_len, NULL, NULL, NULL, &lower_len); memcpy(*s, lowered, lower_len); free(lowered); } </code></pre> and then it does it all in one go. Looks more dangerous in terms of possibly having something from the old data left over of <code>lower_len</code> is shorter than the original string... in short, this code is TOTALLY EXPERIMENTAL AND FOR ILLUSTRATIVE PURPOSES ONLY DO NOT USE THIS IN PRODUCTION IT WILL PROBABLY BREAK. Anyway, ~40% faster this way: <pre class="prettyprint"><code>>>> alist = ['JsDated', 'УКРАЇНА'] * 50000 >>> arr_unicode = np.array(alist) >>> ct = time(); lower_2(arr_unicode); time() - ct 0.022463043975830078 </code></pre>

Fastest way to lowercase a numpy array of unicode strings in Cython

Tags:

python

unicode

numpy

cython

Numpy's string functions are all very slow and are less performant than pure python lists. I am looking to optimize all the normal string functions using Cython.

For instance, let's take a numpy array of 100,000 unicode strings with data type either unicode or object and lowecase each one.

alist = ['JsDated', 'УКРАЇНА'] * 50000
arr_unicode = np.array(alist)
arr_object = np.array(alist, dtype='object')

%timeit np.char.lower(arr_unicode)
51.6 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Using a list comprehension is just as fast

%timeit [a.lower() for a in arr_unicode]
44.7 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

For the object data type, we cannot use np.char. The list comprehension is 3x as fast.

%timeit [a.lower() for a in arr_object]
16.1 ms ± 147 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The only way I know how to do this in Cython is to create an empty object array and call the Python string method lower on each iteration.

import numpy as np
cimport numpy as np
from numpy cimport ndarray

def lower(ndarray[object] arr):
    cdef int i
    cdef int n = len(arr)
    cdef ndarray[object] result = np.empty(n, dtype='object')
    for i in range(n):
        result[i] = arr[i].lower()
    return result

This yields a modest improvement

%timeit lower(arr_object)
11.3 ms ± 383 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I have tried accessing the memory directly with the data ndarray attribute like this:

def lower_fast(ndarray[object] arr):
    cdef int n = len(arr)
    cdef int i
    cdef char* data = arr.data
    cdef int itemsize = arr.itemsize
    for i in range(n):
        # no idea here

I believe data is one contiguous piece of memory holding all the raw bytes one after the other. Accessing these bytes is extremely fast and it seems converting these raw bytes would increase performance by 2 orders of magnitude. I found a tolower c++ function that might work, but I don't know how to hook it in with Cython.

Update with fastest method (doesn't work for unicode)

Here is the fastest method I found by far, from another SO post. This lowercases all the ascii characters by accessing the numpy memoryview via the data attribute. I think it will mangle other unicode characters that have bytes between 65 and 90 as well. But the speed is very good.

cdef int f(char *a, int itemsize, int shape):
    cdef int i
    cdef int num
    cdef int loc
    for i in range(shape * itemsize):
        num = a[i]
        print(num)
        if 65 <= num <= 90:
            a[i] +=32

def lower_fast(ndarray arr):
    cdef char *inp
    inp = arr.data
    f(inp, arr.itemsize, arr.shape[0])
    return arr

This is 100x faster than the others and what I am looking for.

%timeit lower_fast(arr)
103 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

758

asked Dec 27 '17 21:12

Ted Petrou

1 Answers

This was only slightly faster than the list comprehension for me on my machine, but if you want unicode support this might be the fastest way of doing it. You'll need to apt-get install libunistring-dev or whatever is appropriate for your OS / package manager.

In some C file, say, _lower.c, have

#include <stdlib.h>
#include <string.h>   
#include <unistr.h>
#include <unicase.h>

void _c_tolower(uint8_t  **s, uint32_t total_len) {
    size_t lower_len, s_len;
    uint8_t *s_ptr = *s, *lowered;
    while(s_ptr - *s < total_len) {
        s_len = u8_strlen(s_ptr);
        if (s_len == 0) {
            s_ptr += 1;
            continue;
        }
        lowered = u8_tolower(s_ptr, s_len, NULL, NULL, NULL, &lower_len);
        memcpy(s_ptr, lowered, lower_len);
        free(lowered);
        s_ptr += s_len;
    }
}

Then, in lower.pxd you do

cdef extern from "_lower.c":
    cdef void _c_tolower(unsigned char **s, unsigned int total_len)

And finally, in lower.pyx:

cpdef void lower(ndarray arr):
    cdef unsigned char * _arr
    _arr = <unsigned char *> arr.data
    _c_tolower(&_arr, arr.shape[0] * arr.itemsize)

On my laptop, I got 46ms for the list comprehension you had above and 37ms for this method (and 0.8ms for your lower_fast), so it's probably not worth it, but I figured I'd type it out in case you wanted an example of how to hook such a thing into Cython.

There are a few points of improvement that I don't know will make much of a difference:

arr.data is something like a square matrix I guess? (I don't know, I don't use numpy for anything), and pads the ends of the shorter strings with \x00s. I was too lazy to figure out how to get u8_tolower to look past the 0s, so I just manually fast-forward past them (that's what the if (s_len == 0) clause is doing). I suspect that one call to u8_tolower would be significantly faster than doing it thousands of times.
I'm doing a lot of freeing/memcpying. You can probably avoid that if you're clever.
I think it's the case that every lowercase unicode character is at most as wide as its uppercase variant, so this should not run into any segfaults or buffer overwrites or just overlapping substring issues, but don't take my word for that.

Not really an answer, but hope it helps your further investigations!

PS You'll notice that this does the lowering in-place, so the usage would be like this:

>>> alist = ['JsDated', 'УКРАЇНА', '道德經', 'Ну И йЕшШо'] * 2
>>> arr_unicode = np.array(alist)
>>> lower_2(arr_unicode)
>>> for x in arr_unicode:
...     print x
...
jsdated
україна
道德經
ну и йешшо
jsdated
україна
道德經
ну и йешшо

>>> alist = ['JsDated', 'УКРАЇНА'] * 50000
>>> arr_unicode = np.array(alist)
>>> ct = time(); x = [a.lower() for a in arr_unicode]; time() - ct;
0.046072959899902344
>>> arr_unicode = np.array(alist)
>>> ct = time(); lower_2(arr_unicode); time() - ct
0.037489891052246094

EDIT

DUH, you modify the C function to look like this

void _c_tolower(uint8_t  **s, uint32_t total_len) {
    size_t lower_len;
    uint8_t *lowered;

    lowered = u8_tolower(*s, total_len, NULL, NULL, NULL, &lower_len);
    memcpy(*s, lowered, lower_len);
    free(lowered);
}

and then it does it all in one go. Looks more dangerous in terms of possibly having something from the old data left over of lower_len is shorter than the original string... in short, this code is TOTALLY EXPERIMENTAL AND FOR ILLUSTRATIVE PURPOSES ONLY DO NOT USE THIS IN PRODUCTION IT WILL PROBABLY BREAK.

Anyway, ~40% faster this way:

>>> alist = ['JsDated', 'УКРАЇНА'] * 50000
>>> arr_unicode = np.array(alist)
>>> ct = time(); lower_2(arr_unicode); time() - ct
0.022463043975830078

153

answered Nov 06 '22 23:11

gmoss

Related questions
                            
                                sqlalchemy, filter a json column containing an array [duplicate]
                            
                                Using spyder with virtualenv
                            
                                A dictionary with a unique possible value for each key?
                            
                                Password protect a SPECIFIC Jupyter notebook
                            
                                How do I know if tensorflow using cuda and cudnn or not?
                            
                                Delay load python DLL when embedding python+numpy
                            
                                How to find memory leak with pandas
                            
                                printing text below tqdm progress bar
                            
                                How to force dicts to be unordered (for testing)?
                            
                                Python - Error (Relay access denied) while sending email
                            
                                Pycharm not displaying wide Dataframe in Jupyter Notebook
                            
                                "Allocating size to..." GTK Warning when using Gtk.TreeView inside Gtk.ScrolledWindow
                            
                                How can I get Vim to recognize Python3 syntax?
                            
                                ModuleNotFoundError: No module named 'tensorflow.python.training'
                            
                                No warning at undefined variables in PyCharm Community 2017.2
                            
                                Multi-output regression model always returns the same value for a batch in Tensorflow
                            
                                How to locate the four elements using selenium in python
                            
                                How to profile a scrapy python script?
                            
                                Tensorflow - How to use the GPU instead of a CPU for tf.Estimator() CNNs
                            
                                Function that turns sequence into list

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With