Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Three-way comparing strings in Python 3

Say you want to optimize a (byte) string compare intensive algorithm implemented in Python. Since a central code path contains this sequence of statements

if s < t:
    # less than ...
elif t < s:
    # greater than ...
else:
    # equal ...

it would be great to optimize it to something like

r = bytes_compare(s, t)
if r < 0:
    # less than ...
elif r > 0:
    # greater than ...
else:
    # equal ...

where (the hypothetical) bytes_compare() ideally would just call the three-way comparison C function memcmp() which is usually quite well optimized. This would reduce the number of string comparisons by half. A very feasible optimization unless the strings are ultra short.

But how to get there with Python 3?

PS:

Python 3 has removed the three way comparison global function cmp() and the magic method __cmp__(). And even with Python 2, the bytes class doesn't had a __cmp__() member.

With the ctypes package it's straight forward to call memcmp() but the foreign function call overhead with ctypes is prohibitively high.

like image 786
maxschlepzig Avatar asked Jun 10 '18 09:06

maxschlepzig


People also ask

How do you compare three strings in Python?

Python comparison operators can be used to compare strings in Python. These operators are: equal to ( == ), not equal to ( != ), greater than ( > ), less than ( < ), less than or equal to ( <= ), and greater than or equal to ( >= ).

Can you use == for strings in Python?

Python String comparison can be performed using equality (==) and comparison (<, >, != , <=, >=) operators.

What is == comparing in Python?

The == operator compares the value or equality of two objects, whereas the Python is operator checks whether two variables point to the same object in memory. In the vast majority of cases, this means you should use the equality operators == and !=


1 Answers

Python 3 (including 3.6) simply doesn't include any three-way comparison support for strings. Although the internal implementation of the rich comparison operator __lt__(), __eq__() etc. do call memcmp() (in the C implementation of bytes - cf. Objects/bytesobject.c) there is no internal three-way comparison function that could be leveraged.

Thus, writing a C extension that provides a three-way comparison function by calling memcmp() is the next best thing:

#include <Python.h>
static PyObject* cmp(PyObject* self, PyObject* args) {
    PyObject *a = 0, *b = 0;
    if (!PyArg_UnpackTuple(args, "cmp", 2, 2, &a, &b))
        return 0;
    if (!PyBytes_Check(a) || !PyBytes_Check(b)) {
        PyErr_SetString(PyExc_TypeError, "only bytes() strings supported");
        return 0;
    }
    Py_ssize_t n = PyBytes_GET_SIZE(a), m = PyBytes_GET_SIZE(b);
    char *s = PyBytes_AsString(a), *t = PyBytes_AsString(b);
    int r = 0;
    if (n == m) {
        r = memcmp(s, t, n);
    } else if (n < m) {
        r = memcmp(s, t, n);
        if (!r)
            r = -1;
    } else {
        r = memcmp(s, t, m);
        if (!r)
            r = 1;
    }
    return PyLong_FromLong(r);
}
static PyMethodDef bytes_util_methods[] = {
    { "cmp", cmp, METH_VARARGS, "Three way compare 2 bytes() objects." },
    {0,0,0,0} };
static struct PyModuleDef bytes_util_def = {
    PyModuleDef_HEAD_INIT, "bytes_util", "Three way comparison for strings.",
    -1, bytes_util_methods };
PyMODINIT_FUNC PyInit_bytes_util(void) {
    Py_Initialize();
    return PyModule_Create(&bytes_util_def);
}

Compile with:

gcc -Wall -O3 -fPIC -shared bytes_util.c -o bytes_util.so -I/usr/include/python3.6m

Test:

>>> import bytes_util
>>> bytes_util.cmp(b'foo', b'barx')
265725

In contrast to calling memcmp via the ctypes package, this foreign call has the same overhead as the builtin bytes comparison operators (as they also are implemented as C extension with the standard Python version).

like image 169
maxschlepzig Avatar answered Oct 05 '22 19:10

maxschlepzig