Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to lowercase a numpy array of unicode strings in Cython

Numpy's string functions are all very slow and are less performant than pure python lists. I am looking to optimize all the normal string functions using Cython.

For instance, let's take a numpy array of 100,000 unicode strings with data type either unicode or object and lowecase each one.

alist = ['JsDated', 'УКРАЇНА'] * 50000
arr_unicode = np.array(alist)
arr_object = np.array(alist, dtype='object')

%timeit np.char.lower(arr_unicode)
51.6 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Using a list comprehension is just as fast

%timeit [a.lower() for a in arr_unicode]
44.7 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

For the object data type, we cannot use np.char. The list comprehension is 3x as fast.

%timeit [a.lower() for a in arr_object]
16.1 ms ± 147 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The only way I know how to do this in Cython is to create an empty object array and call the Python string method lower on each iteration.

import numpy as np
cimport numpy as np
from numpy cimport ndarray

def lower(ndarray[object] arr):
    cdef int i
    cdef int n = len(arr)
    cdef ndarray[object] result = np.empty(n, dtype='object')
    for i in range(n):
        result[i] = arr[i].lower()
    return result

This yields a modest improvement

%timeit lower(arr_object)
11.3 ms ± 383 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I have tried accessing the memory directly with the data ndarray attribute like this:

def lower_fast(ndarray[object] arr):
    cdef int n = len(arr)
    cdef int i
    cdef char* data = arr.data
    cdef int itemsize = arr.itemsize
    for i in range(n):
        # no idea here

I believe data is one contiguous piece of memory holding all the raw bytes one after the other. Accessing these bytes is extremely fast and it seems converting these raw bytes would increase performance by 2 orders of magnitude. I found a tolower c++ function that might work, but I don't know how to hook it in with Cython.

Update with fastest method (doesn't work for unicode)

Here is the fastest method I found by far, from another SO post. This lowercases all the ascii characters by accessing the numpy memoryview via the data attribute. I think it will mangle other unicode characters that have bytes between 65 and 90 as well. But the speed is very good.

cdef int f(char *a, int itemsize, int shape):
    cdef int i
    cdef int num
    cdef int loc
    for i in range(shape * itemsize):
        num = a[i]
        print(num)
        if 65 <= num <= 90:
            a[i] +=32

def lower_fast(ndarray arr):
    cdef char *inp
    inp = arr.data
    f(inp, arr.itemsize, arr.shape[0])
    return arr

This is 100x faster than the others and what I am looking for.

%timeit lower_fast(arr)
103 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
like image 758
Ted Petrou Avatar asked Dec 27 '17 21:12

Ted Petrou


People also ask

Does NumPy work with Cython?

You can use NumPy from Cython exactly the same as in regular Python, but by doing so you are losing potentially high speedups because Cython has support for fast access to NumPy arrays.

What is faster than NumPy?

pandas provides a bunch of C or Cython optimized functions that can be faster than the NumPy equivalent function (e.g. reading text from text files).

Why is NumPy array faster?

NumPy is fast because it can do all its calculations without calling back into Python. Since this function involves looping in Python, we lose all the performance benefits of using NumPy. For a 10,000,000-entry NumPy array, this functions takes 2.5 seconds to run on my computer.


1 Answers

This was only slightly faster than the list comprehension for me on my machine, but if you want unicode support this might be the fastest way of doing it. You'll need to apt-get install libunistring-dev or whatever is appropriate for your OS / package manager.

In some C file, say, _lower.c, have

#include <stdlib.h>
#include <string.h>   
#include <unistr.h>
#include <unicase.h>

void _c_tolower(uint8_t  **s, uint32_t total_len) {
    size_t lower_len, s_len;
    uint8_t *s_ptr = *s, *lowered;
    while(s_ptr - *s < total_len) {
        s_len = u8_strlen(s_ptr);
        if (s_len == 0) {
            s_ptr += 1;
            continue;
        }
        lowered = u8_tolower(s_ptr, s_len, NULL, NULL, NULL, &lower_len);
        memcpy(s_ptr, lowered, lower_len);
        free(lowered);
        s_ptr += s_len;
    }
}

Then, in lower.pxd you do

cdef extern from "_lower.c":
    cdef void _c_tolower(unsigned char **s, unsigned int total_len)

And finally, in lower.pyx:

cpdef void lower(ndarray arr):
    cdef unsigned char * _arr
    _arr = <unsigned char *> arr.data
    _c_tolower(&_arr, arr.shape[0] * arr.itemsize)

On my laptop, I got 46ms for the list comprehension you had above and 37ms for this method (and 0.8ms for your lower_fast), so it's probably not worth it, but I figured I'd type it out in case you wanted an example of how to hook such a thing into Cython.

There are a few points of improvement that I don't know will make much of a difference:

  • arr.data is something like a square matrix I guess? (I don't know, I don't use numpy for anything), and pads the ends of the shorter strings with \x00s. I was too lazy to figure out how to get u8_tolower to look past the 0s, so I just manually fast-forward past them (that's what the if (s_len == 0) clause is doing). I suspect that one call to u8_tolower would be significantly faster than doing it thousands of times.
  • I'm doing a lot of freeing/memcpying. You can probably avoid that if you're clever.
  • I think it's the case that every lowercase unicode character is at most as wide as its uppercase variant, so this should not run into any segfaults or buffer overwrites or just overlapping substring issues, but don't take my word for that.

Not really an answer, but hope it helps your further investigations!

PS You'll notice that this does the lowering in-place, so the usage would be like this:

>>> alist = ['JsDated', 'УКРАЇНА', '道德經', 'Ну И йЕшШо'] * 2
>>> arr_unicode = np.array(alist)
>>> lower_2(arr_unicode)
>>> for x in arr_unicode:
...     print x
...
jsdated
україна
道德經
ну и йешшо
jsdated
україна
道德經
ну и йешшо

>>> alist = ['JsDated', 'УКРАЇНА'] * 50000
>>> arr_unicode = np.array(alist)
>>> ct = time(); x = [a.lower() for a in arr_unicode]; time() - ct;
0.046072959899902344
>>> arr_unicode = np.array(alist)
>>> ct = time(); lower_2(arr_unicode); time() - ct
0.037489891052246094

EDIT

DUH, you modify the C function to look like this

void _c_tolower(uint8_t  **s, uint32_t total_len) {
    size_t lower_len;
    uint8_t *lowered;

    lowered = u8_tolower(*s, total_len, NULL, NULL, NULL, &lower_len);
    memcpy(*s, lowered, lower_len);
    free(lowered);
}

and then it does it all in one go. Looks more dangerous in terms of possibly having something from the old data left over of lower_len is shorter than the original string... in short, this code is TOTALLY EXPERIMENTAL AND FOR ILLUSTRATIVE PURPOSES ONLY DO NOT USE THIS IN PRODUCTION IT WILL PROBABLY BREAK.

Anyway, ~40% faster this way:

>>> alist = ['JsDated', 'УКРАЇНА'] * 50000
>>> arr_unicode = np.array(alist)
>>> ct = time(); lower_2(arr_unicode); time() - ct
0.022463043975830078 
like image 153
gmoss Avatar answered Nov 06 '22 23:11

gmoss