Say you want to optimize a (byte) string compare intensive algorithm implemented in Python. Since a central code path contains this sequence of statements
if s < t:
# less than ...
elif t < s:
# greater than ...
else:
# equal ...
it would be great to optimize it to something like
r = bytes_compare(s, t)
if r < 0:
# less than ...
elif r > 0:
# greater than ...
else:
# equal ...
where (the hypothetical) bytes_compare()
ideally would just call the three-way comparison C function memcmp()
which is usually quite well optimized. This would reduce the number of string comparisons by half. A very feasible optimization unless the strings are ultra short.
But how to get there with Python 3?
PS:
Python 3 has removed the three way comparison global function cmp()
and the magic method __cmp__()
. And even with Python 2, the bytes
class doesn't had a __cmp__()
member.
With the ctypes
package it's straight forward to call memcmp()
but the foreign function call overhead with ctypes
is prohibitively high.
Python comparison operators can be used to compare strings in Python. These operators are: equal to ( == ), not equal to ( != ), greater than ( > ), less than ( < ), less than or equal to ( <= ), and greater than or equal to ( >= ).
Python String comparison can be performed using equality (==) and comparison (<, >, != , <=, >=) operators.
The == operator compares the value or equality of two objects, whereas the Python is operator checks whether two variables point to the same object in memory. In the vast majority of cases, this means you should use the equality operators == and !=
Python 3 (including 3.6) simply doesn't include any three-way comparison support for strings. Although the internal implementation of the rich comparison operator __lt__()
, __eq__()
etc. do call memcmp()
(in the C implementation of bytes
- cf. Objects/bytesobject.c
) there is no internal three-way comparison function that could be leveraged.
Thus, writing a C extension that provides a three-way comparison function by calling memcmp()
is the next best thing:
#include <Python.h>
static PyObject* cmp(PyObject* self, PyObject* args) {
PyObject *a = 0, *b = 0;
if (!PyArg_UnpackTuple(args, "cmp", 2, 2, &a, &b))
return 0;
if (!PyBytes_Check(a) || !PyBytes_Check(b)) {
PyErr_SetString(PyExc_TypeError, "only bytes() strings supported");
return 0;
}
Py_ssize_t n = PyBytes_GET_SIZE(a), m = PyBytes_GET_SIZE(b);
char *s = PyBytes_AsString(a), *t = PyBytes_AsString(b);
int r = 0;
if (n == m) {
r = memcmp(s, t, n);
} else if (n < m) {
r = memcmp(s, t, n);
if (!r)
r = -1;
} else {
r = memcmp(s, t, m);
if (!r)
r = 1;
}
return PyLong_FromLong(r);
}
static PyMethodDef bytes_util_methods[] = {
{ "cmp", cmp, METH_VARARGS, "Three way compare 2 bytes() objects." },
{0,0,0,0} };
static struct PyModuleDef bytes_util_def = {
PyModuleDef_HEAD_INIT, "bytes_util", "Three way comparison for strings.",
-1, bytes_util_methods };
PyMODINIT_FUNC PyInit_bytes_util(void) {
Py_Initialize();
return PyModule_Create(&bytes_util_def);
}
Compile with:
gcc -Wall -O3 -fPIC -shared bytes_util.c -o bytes_util.so -I/usr/include/python3.6m
Test:
>>> import bytes_util
>>> bytes_util.cmp(b'foo', b'barx')
265725
In contrast to calling memcmp
via the ctypes
package, this foreign call has the same overhead as the builtin bytes comparison operators (as they also are implemented as C extension with the standard Python version).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With