Speeding up python code with cython

I have a function which just basically makes lots of calls to a simple defined hash function and tests to see when it finds a duplicate. I need to do lots of simulations with it so would like it to be as fast as possible. I am attempting to use cython to do this. The cython code is currently called with a normal python list of integers with values in the range 0 to m^2.

import math, random
cdef int a,b,c,d,m,pos,value, cyclelimit, nohashcalls   
def h3(int a,int b,int c,int d, int m,int x):
    return (a*x**2 + b*x+c) %m    
def floyd(inputx):
    dupefound, nohashcalls = (0,0)
    m = len(inputx)
    loops = int(m*math.log(m))
    for loopno in xrange(loops):
        if (dupefound == 1):
        a = random.randrange(m)
        b = random.randrange(m)
        c = random.randrange(m)
        d = random.randrange(m)
        pos = random.randrange(m)
        value = inputx[pos]
        listofpos = [0] * m
        listofpos[pos] = 1
        setofvalues = set([value])
        cyclelimit = int(math.sqrt(m))
        for j in xrange(cyclelimit):
            pos = h3(a,b, c,d, m, inputx[pos])
            nohashcalls += 1    
            if (inputx[pos] in setofvalues):
                if (listofpos[pos]==1):
                    dupefound = 0
                    dupefound = 1
                    print "Duplicate found at position", pos, " and value", inputx[pos]
            listofpos[pos] = 1
    return dupefound, nohashcalls 

How can I convert inputx and listofpos to use C type arrays and to access the arrays at C speed? Are there any other speed ups I can use? Can setofvalues be sped up?

So that there is something to compare against, 50 calls to floyd() with m = 5000 currently takes around 30 seconds on my computer.

Update: Example code snippet to show how floyd is called.

m = 5000
inputx = random.sample(xrange(m**2), m)
(dupefound, nohashcalls) = edcython.floyd(inputx)
First of all, it seems that you must type the variables inside the function. A good example of it is here.

Second, cython -a, for "annotate", gives you a really excellent break down of the code generated by the cython compiler and a color-coded indication of how dirty (read: python api heavy) it is. This output is really essential when trying to optimize anything.

Third, the now famous page on working with Numpy explains how to get fast, C-style access to the Numpy array data. Unforunately it's verbose and annoying. We're in luck however, because more recent Cython provides Typed Memory Views, which are both easy to use and awesome. Read that entire page before you try to do anything else.

After ten minutes or so I came up with this:

# cython: infer_types=True

# Use the C math library to avoid Python overhead.
from libc cimport math
# For boundscheck below.
import cython
# We're lazy so we'll let Numpy handle our array memory management.
import numpy as np
# You would normally also import the Numpy pxd to get faster access to the Numpy
# API, but it requires some fancier compilation options so I'll leave it out for
# this demo.
# cimport numpy as np

import random

# This is a small function that doesn't need to be exposed to Python at all. Use
# `cdef` instead of `def` and inline it.
cdef inline int h3(int a,int b,int c,int d, int m,int x):
    return (a*x**2 + b*x+c) % m

# If we want to live fast and dangerously, we tell cython not to check our array
# indices for IndexErrors. This means we CAN overrun our array and crash the
# program or screw up our stack. Use with caution. Profiling suggests that we
# aren't gaining anything in this case so I leave it on for safety.
# @cython.boundscheck(False)
# `cpdef` so that calling this function from another Cython (or C) function can
# skip the Python function call overhead, while still allowing us to use it from
# Python.
cpdef floyd(int[:] inputx):
    # Type the variables in the scope of the function.
    cdef int a,b,c,d, value, cyclelimit
    cdef unsigned int dupefound = 0
    cdef unsigned int nohashcalls = 0
    cdef unsigned int loopno, pos, j

    # `m` has type int because inputx is already a Cython memory view and
    # `infer-types` is on.
    m = inputx.shape[0]

    cdef unsigned int loops = int(m*math.log(m))

    # Again using the memory view, but letting Numpy allocate an array of zeros.
    cdef int[:] listofpos = np.zeros(m, dtype=np.int32)

    # Keep this random sampling out of the loop
    cdef int[:, :] randoms = np.random.randint(0, m, (loops, 5)).astype(np.int32)

    for loopno in range(loops):
        if (dupefound == 1):

        # From our precomputed array
        a = randoms[loopno, 0]
        b = randoms[loopno, 1]
        c = randoms[loopno, 2]
        d = randoms[loopno, 3]
        pos = randoms[loopno, 4]

        value = inputx[pos]

        # Unforunately, Memory View does not support "vectorized" operations
        # like standard Numpy arrays. Otherwise we'd use listofpos *= 0 here.
        for j in range(m):
            listofpos[j] = 0

        listofpos[pos] = 1
        setofvalues = set((value,))
        cyclelimit = int(math.sqrt(m))
        for j in range(cyclelimit):
            pos = h3(a, b, c, d, m, inputx[pos])
            nohashcalls += 1
            if (inputx[pos] in setofvalues):
                if (listofpos[pos]==1):
                    dupefound = 0
                    dupefound = 1
                    print "Duplicate found at position", pos, " and value", inputx[pos]
            listofpos[pos] = 1
    return dupefound, nohashcalls

There are no tricks here that aren't explained on docs.cython.org, which is where I learned them myself, but helps to see it all come together.

The most important changes to your original code are in the comments, but they all amount to giving Cython hints about how to generate code that doesn't use the Python API.

As an aside: I really don't know why infer_types is not on by default. It lets the compiler implicitly use C types instead of Python types where possible, meaning less work for you.

If you run cython -a on this, you'll see that the only lines that call into Python are your calls to random.sample, and building or adding to a Python set().

On my machine, your original code runs in 2.1 seconds. My version runs in 0.6 seconds.

The next step is to get random.sample out of that loop, but I'll leave that to you.

I have edited my answer to demonstrate how to precompute the rand samples. This brings the time down to 0.4 seconds.

