Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Implementing a "Kurtosis filter" using scipys generic_filter

I have a 5000*5000 numpy array on which I want to calculate the Kurtosis for windows of size 25. I tried putting scipys own kurtosis function in the generic_filter found in ndimage.filters like so:

import numpy as np

from scipy.stats import kurtosis
from scipy.ndimage.filters import generic_filter

mat = np.random.random_sample((5000, 5000))

kurtosis_filter = generic_filter(mat, kurtosis, size=25, mode='reflect') 

This never ends and I'm not sure at all of it gives the correct answer. So my first question is if this is a correct way to use the generic_filter with a scipy function. If it happened to be correct, then it is too slow for it to be of any use to me. So my next question would be if there's a faster way to achieve this? For example, thinking about a standard deviation you can simply do something like:

usual_mean = uniform_filter(mat, size=25, mode='reflect')
mean_of_squared = uniform_filter(np.multiply(mat,mat), size=25, mode='reflect')
standard_deviation = (mean_of_squared - np.multiply(usual_mean,usual_mean))**.5

This is blazing fast and simply comes from the fact that $\sigma^2 = E[(X -\mu)^2] = E[X^2] - (E[X])^2$.

like image 822
JEquihua Avatar asked Jun 18 '14 22:06

JEquihua


1 Answers

Your approach is correct, but as you note, it is way too slow for the task at hand. Consider how large your task is in the numerically best implementation (not bothering about boundary values):

def kurt(X, w):
    n, m = X.shape
    K = np.zeros_like(X)

    for i in xrange(w, n-w):                       # 5000 iterations
        for j in xrange(w, m-w):                   # 5000 iterations
            x = X[i-w:i+w+1,j-w:j+w+1].flatten()   # copy 25*25=625 values
            x -= x.mean()                          # calculate and subtract mean
            x /= np.sqrt((x**2).mean())            # normalize by stddev (625 mult.)
            K[i,j] = (x**4).mean() - 3.            # 2*625 = 1250 multiplications
    return K

So we have 5000*5000*1875 ~ 47 billion (!) multiplications. This will even be too slow to be useful in a plain C implementation, let alone by passing a Python function kurtosis() to the inner loop of generic_filter(). The latter is actually calling a C extension function, but there are negligible benefits since it must call back into Python at each iteration, which is very expensive.

So, the actual problem is that you need a better algorithm. Since scipy doesn't have it, let's develop it step by step here.

The key observation that permits acceleration of this problem is that the kurtosis calculations for successive windows are based on mostly the same values, except one row (25 values) which is replaced. So, instead of recalculating the kurtosis from scratch using all 625 values, we attempt to keep track of previously calculated sums and update them such that only the 25 new values need to be processed.

This requires expanding the (x - mu)**4 factor, since only the running sums over x, x**2, x**3 and x**4 can be easily updated. There is no nice cancellation as in the formula for the standard deviation that you mentioned, but it is entirely feasible:

def kurt2(X, w):
    n, m = X.shape
    K = np.zeros_like(X)
    W = 2*w + 1

    for j in xrange(m-W+1):
        for i in xrange(n-W+1):
            x = X[i:i+W,j:j+W].flatten()
            x2 = x*x
            x3 = x2*x
            x4 = x2*x2

            M1 = x.mean()
            M2 = x2.mean()
            M3 = x3.mean()
            M4 = x4.mean()
            M12 = M1*M1
            V = M2 - M12;

            K[w+i,w+j] = (M4 - 4*M1*M3 + 3*M12*(M12 + 2*V)) / (V*V) - 3
    return K

Note: The algorithm written in this form is numerically less stable, since we let numerator and denominator become individually very large, while previously we were dividing early to prevent this (even at the cost of a sqrt). However, I found that for the kurtosis this was never an issue for practical applications.

In the code above, I have tried to minimize the number of multiplications. The running means M1, M2, M3 and M4 can now be updated rather easily, by subtracting the contributions of the row that is no longer part of the window and adding the contributions of the new row.

Let's implement this:

def kurt3(X, w):
    n, m = X.shape
    K = np.zeros_like(X)
    W = 2*w + 1
    N = W*W

    Xp = np.zeros((4, W, W), dtype=X.dtype)
    xp = np.zeros((4, W), dtype=X.dtype)

    for j in xrange(m-W+1):
        # reinitialize every time we reach row 0
        Xp[0] = x1 = X[:W,j:j+W]
        Xp[1] = x2 = x1*x1
        Xp[2] = x3 = x2*x1
        Xp[3] = x4 = x2*x2

        s = Xp.sum(axis=2)       # make sure we sum along the fastest index
        S = s.sum(axis=1)        # the running sums
        s = s.T.copy()           # circular buffer of row sums 

        M = S / N
        M12 = M[0]*M[0]
        V = M[1] - M12;

        # kurtosis at row 0
        K[w,w+j] = (M[3] - 4*M[0]*M[2] + 3*M12*(M12 + 2*V)) / (V*V) - 3

        for i in xrange(n-W):
            xp[0] = x1 = X[i+W,j:j+W]   # the next row
            xp[1] = x2 = x1*x1
            xp[2] = x3 = x2*x1
            xp[3] = x4 = x2*x2

            k = i % W                   # index in circular buffer
            S -= s[k]                   # remove cached contribution of old row
            s[k] = xp.sum(axis=1)       # cache new row
            S += s[k]                   # add contributions of new row

            M = S / N
            M12 = M[0]*M[0]
            V = M[1] - M12;

            # kurtosis at row != 0
            K[w+1+i,w+j] = (M[3] - 4*M[0]*M[2] + 3*M12*(M12 + 2*V)) / (V*V) - 3
    return K

Now that we have a good algorithm, we note that the timing results are still rather disappointing. Our problem is now that Python + numpy is the wrong language for such a number crunching job. Let's write a C extension! Here is _kurtosismodule.c:

#include <Python.h>
#include <numpy/arrayobject.h>

static inline void add_line(double *b, double *S, const double *x, size_t W) {
    size_t l;
    double x1, x2;
    b[0] = b[1] = b[2] = b[3] = 0.;
    for (l = 0; l < W; ++l) {
        b[0] += x1 = x[l];
        b[1] += x2 = x1*x1;
        b[2] += x2*x1;
        b[3] += x2*x2;
    }
    S[0] += b[0];
    S[1] += b[1];
    S[2] += b[2];
    S[3] += b[3];
}

static PyObject* py_kurt(PyObject* self, PyObject* args) {
    PyObject *objK, *objX, *objB;
    int w;
    PyArg_ParseTuple(args, "OOOi", &objK, &objX, &objB, &w);
    double *K = PyArray_DATA(objK);
    double *X = PyArray_DATA(objX);
    double *B = PyArray_DATA(objB);

    size_t n = PyArray_DIM(objX, 0);
    size_t m = PyArray_DIM(objX, 1);
    size_t W = 2*w + 1, N = W*W, i, j, k, I, J;

    double *S = B + 4*W;
    double *x, *b, M, M2, V;

    for (j = 0, J = m*w + w; j < m-W+1; ++j, ++J) {
        S[0] = S[1] = S[2] = S[3] = 0.;
        for (k = 0, x = X + j, b = B; k < W; ++k, x += m, b += 4) {
            add_line(b, S, x, W);
        }

        M = S[0] / N;
        M2 = M*M;
        V = S[1] / N - M2;
        K[J] = ((S[3] - 4*M*S[2]) / N + 3*M2*(M2 + 2*V)) / (V*V) - 3;

        for (i = 0, I = J + m; i < n-W; ++i, x += m, I += m) {
            b = B + 4*(i % W);   // row in circular buffer
            S[0] -= b[0];
            S[1] -= b[1];
            S[2] -= b[2];
            S[3] -= b[3];

            add_line(b, S, x, W);

            M = S[0] / N;
            M2 = M*M;
            V = S[1] / N - M2;
            K[I] = ((S[3] - 4*M*S[2]) / N + 3*M2*(M2 + 2*V)) / (V*V) - 3;
        }
    }
    Py_RETURN_NONE;
}


static PyMethodDef methods[] = {
    {"kurt", py_kurt, METH_VARARGS, ""},
    {0}
};


PyMODINIT_FUNC init_kurtosis(void) {
    Py_InitModule("_kurtosis", methods);
    import_array();
}

Build with:

python setup.py build_ext --inplace

where setup.py is:

from distutils.core import setup, Extension
module = Extension('_kurtosis', sources=['_kurtosismodule.c'])
setup(ext_modules=[module])

Note that we don't allocate any memory in the C extension. This way, we don't have to get into any mess with reference counts/garbage collection. We just use an entry point in Python:

import _kurtosis

def kurt4(X, w):
    # add type/size checking if you like
    K = np.zeros(X.shape, np.double)
    scratch = np.zeros(8*(w + 1), np.double)
    _kurtosis.kurt(K, X, scratch, w)
    return K

Finally, let's do the timing:

In [1]: mat = np.random.random_sample((5000, 5000))

In [2]: %timeit K = kurt4(mat, 12)   # 2*12 + 1 = 25
1 loops, best of 3: 5.25 s per loop

A very reasonable performance given the size of the task!

like image 184
Stefan Avatar answered Oct 02 '22 02:10

Stefan