Cythonize list of all splits of a string

Tags:

cython

I'm trying to speed up a piece of code that generates all possible splits of a string.

splits('foo') -> [('f', 'oo'), ('fo', 'o'), ('foo', '')]

The code for this in python is very simple:

def splits(text):
    return [(text[:i + 1], text[i + 1:])
            for i in range(len(text))]

Is there a way to speed this up via cython or some other means? For context, the greater purpose of this code is to find the split of a string with the highest probability.

935

asked Apr 22 '17 00:04

1 Answers

This isn't the sort of problem that Cython tends to help with much. It's using slicing, which ends up largely the same speed as pure Python (i.e. actually pretty good).

Using a 100 character long byte string (b'0'*100) and 10000 iterations in timeit I get:

Your code as written - 0.37s
Your code as written but compiled in Cython - 0.21s
Your code with the line cdef int i and compiled in Cython - 0.20s (this is reproducably a small improvement. It's more significant with longer strings)
Your cdef int i and the parameter typed to bytes text - 0.28s (i.e. worse).

Best speed is got by using the Python C API directly (see code below) - 0.11s. I've chosen to do this mostly in Cython (but calling the API functions myself) for convenience, but you could write very similar code in C directly with a little more manual error checking. I've written this for the Python 3 API assuming you're using bytes objects (i.e. PyBytes instead of PyString) so if you're using Python 2, or Unicode and Python 3 you'll have to change it a little.

from cpython cimport *
cdef extern from "Python.h":
    # This isn't included in the cpython definitions
    # using PyObject* rather than object lets us control refcounting
    PyObject* Py_BuildValue(const char*,...) except NULL

def split(text):
   cdef Py_ssize_t l,i
   cdef char* s

   # Cython automatically checks the return value and raises an error if 
   # these fail. This provides a type-check on text
   PyBytes_AsStringAndSize(text,&s,&l)
   output = PyList_New(l)

   for i in range(l):
       # PyList_SET_ITEM steals a reference
       # the casting is necessary to ensure that Cython doesn't
       # decref the result of Py_BuildValue
       PyList_SET_ITEM(output,i,
                       <object>Py_BuildValue('y#y#',s,i+1,s+i+1,l-(i+1)))
   return output

If you don't want to go all the way with using the C API then a version that preallocates the list output = [None]*len(text) and does a for-loop rather than a list comprehension is marginally more efficient than your original version - 0.18s

In summary, just compiling it in Cython gives you a decent speed up (a bit less than 2x) and setting the type of i helps a little. This is all you can really achieve with Cython conventionally. To get full speed you basically need to resort to using the Python C API directly. That gets you a little under a 4x speed up which I think is pretty decent.

104

answered Sep 19 '22 13:09

DavidW

Related questions
                            
                                Plotly domain variable explained (Multiple graphs)
                            
                                move from pandas to dask to utilize all local cpu cores
                            
                                Convert bool values to string in json.dumps()
                            
                                Using filters with list_routes in drf
                            
                                Change font size without messing with Tkinter button size
                            
                                OpenCV remap interpolation error?
                            
                                Understanding ZMQ's HWM
                            
                                Use Custom Filter with Django ModelChoice Filter
                            
                                Python and Yahoo finance weird "YQLQueryError(response['error']['description'])" get_historical
                            
                                Keras merge layer warning
                            
                                Python format default rounding when formatting float number
                            
                                How to add additional variable to pytest html report
                            
                                PyQt: Why does a QButtonGroup in a method need 'self.' prefix to signal buttonClicked?
                            
                                Why does python's re.search method hang?
                            
                                Extracting text from a PDF - All pages and Output - file using Python
                            
                                Python ldap macOS - ValueError: option error
                            
                                SQL Alchemy : Table primary_key - unexpected keyword argument error
                            
                                Build a universal wheel from setup.py
                            
                                Python capture subprocess output after termination
                            
                                Pandas: how to use between_time with milliseconds?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Cythonize list of all splits of a string

Tags:

python

cython

Luke

People also ask

1 Answers

DavidW

Recent Activity

Donate For Us