Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimising string generate and test

I am trying to run a simulation to test the average Levenshtein distance between random binary strings.

To speed it up I am using this C extension.

My code is as follows.

from Levenshtein import distance 
for i in xrange(20):
    sum = 0
    for j in xrange(1000):
        str1 =  ''.join([random.choice("01") for x in xrange(2**i)])
        str2 =  ''.join([random.choice("01") for x in xrange(2**i)])
        sum += distance(str1,str2)
    print sum/(1000*2**i)

I think the slowest part is now the string generation. Can that be sped up somehow or is there some other speed up I could try?

I also have 8 cores but I don't know how hard it would be take advantage of those.

Unfortunately I can't use pypy because of the C extension.

like image 433
marshall Avatar asked Apr 27 '13 09:04

marshall


2 Answers

The following solution should be way better in terms of runtime.

It generates a number with 2**i random bits (random.getrandbits), converts it to a string of the number's binary representation (bin), takes everything beginning with the 3nd character to the end (because the result of bin is prepended with '0b') and prepends the resulting string with zeros to have the length you want.

str1 = bin(random.getrandbits(2**i))[2:].zfill(2**i)

Quick timing for your maximum string length of 2**20:

from timeit import Timer
>>> t=Timer("''.join(random.choice('01') for x in xrange(2**20))", "import random")
>>> sorted(t.repeat(10,1))
[0.7849910731831642, 0.787418033587528, 0.7894113893237318, 0.789840397476155, 0.7907980049587877, 0.7908638883536696, 0.7911707057912736, 0.7935838766477445, 0.8014726470912592, 0.8228315074311467]
>>> t=Timer("bin(random.getrandbits(2**20))[2:].zfill(2**20)", "import random")
>>> sorted(t.repeat(10,1))
[0.005115922216191393, 0.005215130351643893, 0.005234282501078269, 0.005451850921190271, 0.005531523863737675, 0.005627284612046424, 0.005746794025981217, 0.006217553864416914, 0.014556016781853032, 0.014710766150983545]

That's a speedup of a factor of 150 on average.

like image 141
halex Avatar answered Oct 06 '22 00:10

halex


You can create a Python string using the Python/C API, which will be significantly faster than any method that exclusively uses Python, since Python itself is implemented in Python/C. Performance will likely primarily depend on the efficiency of the random number generator. If you are on a system with a reasonable random(3) implementation, such as the one in glibc, an efficient implementation of random string would look like this:

#include <Python.h>

/* gcc -shared -fpic -O2 -I/usr/include/python2.7 -lpython2.7 rnds.c -o rnds.so */

static PyObject *rnd_string(PyObject *ignore, PyObject *args)
{
    const char choices[] = {'0', '1'};
    PyObject *s;
    char *p, *end;
    int size;
    if (!PyArg_ParseTuple(args, "i", &size))
        return NULL;
    // start with a two-char string to avoid the empty string singleton.
    if (!(s = PyString_FromString("xx")))
        return NULL;
    _PyString_Resize(&s, size);
    if (!s)
        return NULL;
    p = PyString_AS_STRING(s);
    end = p + size;
    for (;;) {
      unsigned long rnd = random();
      int i = 31;   // random() provides 31 bits of randomness
      while (i-- > 0 && p < end) {
        *p++ = choices[rnd & 1];
        rnd >>= 1;
      }
      if (p == end)
        break;
    }
    return s;
}

static PyMethodDef rnds_methods[] = {
    {"rnd_string",  rnd_string, METH_VARARGS },
    {NULL, NULL, 0, NULL}
};

PyMODINIT_FUNC initrnds(void)
{
    Py_InitModule("rnds", rnds_methods);
}

Testing this code with halex's benchmark shows that it is 280x faster than the original code, and 2.3x faster than halex's code (on my machine):

# the above code
>>> t1 = Timer("rnds.rnd_string(2**20)", "import rnds")
>>> sorted(t1.repeat(10,1))
[0.0029861927032470703, 0.0029909610748291016, ...]
# original generator
>>> t2 = Timer("''.join(random.choice('01') for x in xrange(2**20))", "import random")
>>> sorted(t2.repeat(10,1))
[0.8376679420471191, 0.840252161026001, ...]
# halex's generator
>>> t3 = Timer("bin(random.getrandbits(2**20-1))[2:].zfill(2**20-1)", "import random")
>>> sorted(t3.repeat(10,1))
[0.007007122039794922, 0.007027149200439453, ...]

Adding C code to a project is a complication, but for a 280x speedup of a critical operation, it might well be worth it.

For further efficiency improvement, look into faster RNGs, and invoke call them from separate threads, in order to parallelize random number generation is parallelized. The latter would benefit from a lock-free synchronization mechanism to make sure that inter-thread communication doesn't bog down the otherwise fast generation process.

like image 23
user4815162342 Avatar answered Oct 06 '22 01:10

user4815162342