Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using cython to speed up thousands of set operations

I have been trying to get over my fear of Cython (fear because I literally know NOTHING about c, or c++)

I have a function which takes 2 arguments, a set (we'll call it testSet), and a list of sets (we'll call that targetSets). The function then iterates through targetSets, and computes the length of the intersection with testSet, adding that value to a list, which is then returned.

Now, this isn't by itself that slow, but the problem is I need to do simulations of the testSet (and a large number at that, ~ 10,000), and the targetSet is about 10,000 sets long.

So for a small number of simulations to test, the pure python implementation was taking ~50 secs.

I tried making a cython function, and it worked and it's now running at ~16 secs.

If there is anything else that I could do to the cython function that anyone could think of that would be great (python 2.7 btw)

Here is my Cython implementation in overlapFunc.pyx

def computeOverlap(set testSet, list targetSets):
    cdef list obsOverlaps  = []
    cdef int i, N
    cdef set overlap
    N = len(targetSets)
    for i in range(N):
        overlap = testSet & targetSets[i]
        if len(overlap) <= 1:
            obsOverlaps.append(0)
        else:
            obsOverlaps.append(len(overlap))
    return obsOverlaps

and the setup.py

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

ext_modules = [Extension("overlapFunc", 
                         ["overlapFunc.pyx"])]

setup(
      name = 'computeOverlap function',
      cmdclass = {'build_ext': build_ext},
      ext_modules = ext_modules
      )

and some code to build some random sets for testing and to time the function. test.py

import numpy as np
from overlapFunc import computeOverlap
import time

def simRandomSet(n):
    for i in range(n):
        simSet= set(np.random.randint(low=1, high=100, size=50))
        yield simSet


if __name__ == '__main__':
    np.random.seed(23032014)
    targetSet = [set(np.random.randint(low=1, high=100, size=50)) for i in range(10000)]

    simulatedTestSets = simRandomSet(200)
    start = time.time()
    for i in simulatedTestSets:
        obsOverlaps = computeOverlap(i, targetSet)
    print time.time()-start

I tried changing the def at the start of the computerOverlap function, as in:

cdef list computeOverlap(set testSet, list targetSets):

but I get the following warning message when I run the setup.py script:

'__pyx_f_11overlapFunc_computeOverlap' defined but not used [-Wunused-function]

and then when I run something that tries to use the function I get an import Error:

    from overlapFunc import computeOverlap
ImportError: cannot import name computeOverlap

Thanks in advance for your help,

Cheers,

Davy

like image 688
Davy Kavanagh Avatar asked Nov 01 '22 03:11

Davy Kavanagh


1 Answers

In the following line, the extension module name and the filename does not match actual filename.

ext_modules = [Extension("computeOverlapWithGeneList", 
                         ["computeOverlapWithGeneList.pyx"])]

Replace it with:

ext_modules = [Extension("overlapFunc",
                         ["overlapFunc.pyx"])]
like image 74
falsetru Avatar answered Nov 08 '22 06:11

falsetru