Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String manipulation in Cython

I have code that does some very CPU-intensive string manipulations and I was looking for ways to improve performance.

(EDIT: I'm doing stuff like finding longest common substring, running lots of regular expressions which might be better expressed as state machines in c, stripping comments from HTML, stuff like that.)

I am currently looking into porting some of the code to Cython after hearing many good things about it. However, it seems that the main focus of Cython is numerical calculations and working with strings is barely documented.

Unicode might also be a big issue.

My questions are:

  1. Should I even bother with Cython for string stuff? Does anyone have experience with this type of processing in cython and can share?
  2. Am I missing something in the Cython docs? Does anyone know of a tutorial/reference/documentation about working with strings in Cython?
like image 355
itsadok Avatar asked Jun 03 '09 09:06

itsadok


2 Answers

Just for completeness, what I ended up doing is just write (some of) the string manipulation code in C.

As it turns out, it's ridiculously easy to get started writing c extensions to python. Unicode strings are just arrays of Py_UNICODE, which is an int or a short depending on the python build.

I got a x20 improvement converting code like

s = re.sub(r' +', ' ', s) 

to c. I got similar improvements with more complicated regexps, but the c code becomes crazy complex very quickly.

Overall, my throughput went up 20% after the rewrite. I am now looking for more things to rewrite...

like image 24
itsadok Avatar answered Sep 22 '22 15:09

itsadok


I voted up the 'profile it' answer, but wanted to add this: where possible the best optimisation you can make is to use Python standard libraries or built-in functions to perform the tasks you want. These are typically implemented in C and will provide performance broadly equivalent to any extension, including extensions written in Cython. If your algorithms are performing character by character loops in Python then those should be the first things to go, if possible.

But if you have algorithms that can't be reworked in terms of built-ins or other existing standard libraries, Cython seems like a reasonable approach. It just compiles pseudo-Python down to native code and is as suited to string operations as any other operation, really. But I'm not convinced you will see a great benefit from using Cython if you just hand it idiomatic Python code. The maximum benefit will come if you are able to rewrite some or all of each algorithm in C so that low-level operations are not constantly translating variables across the Python/C barrier.

Finally, Unicode - you've implied it might be 'a big issue' but haven't specified how you're using it. Cython will presumably produce C code that calls the relevant Python APIs that handle Unicode so the functionality is unlikely to be limited. However handling Unicode strings in C is non-trivial and may mean that the idea of rewriting some of your algorithms in C for better performance isn't worth the effort. A lot of classic string algorithms simply won't work on many Unicode encodings, which aren't 'strings' in the traditional sense of having 1 unit of storage per character.

like image 188
Kylotan Avatar answered Sep 20 '22 15:09

Kylotan