Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python / Cython / Numpy optimization of np.nonzero

I have a piece of code that I am trying to optimize. The majority of the code execution time is taken by cdef np.ndarray index = np.argwhere(array==1) Where array is a numpy is a 512x512,512 numpy array of zeros and ones. Any thoughts on speeding this up? Using Python 2.7, Numpy 1.8.1

Sphericity Function

def sphericity(self,array):

    #Pass an mask array (1's are marked, 0's ignored)
    cdef np.ndarray index = np.argwhere(array==1)
    cdef int xSize,ySize,zSize
    xSize,ySize,zSize=array.shape

    cdef int sa,vol,voxelIndex,x,y,z,neighbors,xDiff,yDiff,zDiff,x1,y1,z1
    cdef float onethird,twothirds,sp
    sa=vol=0 #keep running tally of volume and surface area
    #cdef int nonZeroCount = (array != 0).sum() #Replaces np.count_nonzero(array) for speed
    for voxelIndex in range(np.count_nonzero(array)):
    #for voxelIndex in range(nonZeroCount):
        x=index[voxelIndex,0]
        y=index[voxelIndex,1]
        z=index[voxelIndex,2]
        #print x,y,z,array[x,y,z]
        neighbors=0
        vol+=1

        for xDiff in [-1,0,1]:
            for yDiff in [-1,0,1]:
                for zDiff in [-1,0,1]:
                    if abs(xDiff)+abs(yDiff)+abs(zDiff)==1:
                        x1=x+xDiff
                        y1=y+yDiff
                        z1=z+zDiff
                        if x1>=0 and y1>=0 and z1>=0 and x1<xSize and y1<ySize and z1<zSize:
                            #print '-',x1,y1,z1,array[x1,y1,z1]
                            if array[x1,y1,z1]:
                                #print '-',x1,y1,z1,array[x1,y1,z1]
                                neighbors+=1

        #print 'had this many neighbors',neighbors
        sa+=(6-neighbors)

    onethird=float(1)/float(3)
    twothirds=float(2)/float(3)
    sph = ((np.pi**onethird)*((6*vol)**twothirds)) / sa
    #print 'sphericity',sphericity
    return sph

Profiling Test

#Imports
import pstats, cProfile
import numpy as np
import pyximport
pyximport.install(setup_args={"script_args":["--compiler=mingw32"], "include_dirs":np.get_include()}, reload_support=True) #Generate cython version

#Create fake array to calc sphericity
fakeArray=np.zeros((512,512,512))
fakeArray[200:300,200:300,200:300]=1

#Profiling stuff
cProfile.runctx("sphericity(fakeArray)", globals(), locals(), "Profile.prof")
s = pstats.Stats("Profile.prof")
s.strip_dirs().sort_stats("time").print_stats()

Output of profiling

Mon Oct 06 11:49:57 2014    Profile.prof

         12 function calls in 4.373 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    3.045    3.045    4.373    4.373 <string>:1(<module>)
        1    1.025    1.025    1.025    1.025 {method 'nonzero' of 'numpy.ndarray' objects}
        2    0.302    0.151    0.302    0.151 {numpy.core.multiarray.array}
        1    0.001    0.001    1.328    1.328 numeric.py:731(argwhere)
        1    0.000    0.000    0.302    0.302 fromnumeric.py:492(transpose)
        1    0.000    0.000    0.302    0.302 fromnumeric.py:38(_wrapit)
        1    0.000    0.000    0.000    0.000 {method 'transpose' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.302    0.302 numeric.py:392(asarray)
        1    0.000    0.000    0.000    0.000 numeric.py:462(asanyarray)
        1    0.000    0.000    0.000    0.000 {getattr}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
like image 516
Dbricks Avatar asked Oct 06 '14 18:10

Dbricks


2 Answers

Jaime may have given a good answer, but I will comment on improving the Cython code and add a performance comparison.

First, you should use the 'annotate' feature, cython -a filename.pyx, this will generate an HTML file. Load that in a browser and it highlights 'slow' lines with yellow-orange, this indicates where improvements can be made.

Annotate immediately reveals two things which are easily fixed:

Convert idioms to things cython understands

Firstly, these lines are slow:

        for xDiff in [-1,0,1]:
            for yDiff in [-1,0,1]:
                for zDiff in [-1,0,1]:

The reason for this is that Cython does not know how to convert list iteration into clean c code. It needs to be turned into equivalent code which Cython can optimize, namely the 'in range' form:

        for xDiff in range(-1, 2):
            for yDiff in range(-1, 2):
                for zDiff in range(-1, 2):

Type arrays for fast indexing

The next thing is that this line is slow:

                            if array[x1,y1,z1]:

The reason for this is that array has not been given a type. Because of that it uses python level indexing rather than c level indexing. To remedy this you need to give array a type, this can be done in in this way:

def sphericity(np.ndarray[np.uint8_t, ndim=3] array):

That is assuming the type of the array is 'uint8', replace with the appropriate type (Note: Cython does not support 'np.bool' type, hence I use 'uint8')

You can also use a memory view, you can't use numpy functions on a memory view but you can create a view on the array then index the view instead of the array:

    cdef np.uint8_t array_view [:, :, :] = array
    ...
                                    if array_view[x1,y1,z1]:

The memory view will probably be slightly faster, and makes a clear division between the array (python level calls) and the view (c level calls). If you use no numpy functions you can use a memory view no problems.

Rewrite code to avoid multiple passes over array

What is left is that calculating index and nonZeroCount are both slow, this is for various reasons but mainly relates just to the sheer size of the data (essentially, iterating over 512*512*512 elements just takes time!) Generally speaking anything Numpy can do, optimized Cython can do faster (usually 2-10x faster) - numpy just saves you a lot of reinventing the wheel and a lot of typing and lets you think at a higher level (and if you're not also a c programmer you may not be able to optimize cython well enough). But in this case its easy, you can just eliminate index and nonZeroCount and all related code and just do this:

    for x in range(0, xSize):
        for y in range(0, ySize):
            for z in range(0, zSize):
                if array[x,y,z] == 0:
                    continue
                ... 

This is exceedingly fast as c (which clean Cython compiles down to flawlessly) has no trouble performing billions of operations per second. By eliminating the index and nonZeroCount steps you have essentially save two whole iterations over the entire array, which even at maximum speed requires a minimum of about 0.1 seconds each. Even more important is CPU caching, the entire array is 128mb, much larger than a cpu cache, so doing everything in one pass makes better use of the cpu cache (multiple passes wouldn't matter so much if the arrays fit entirely in the CPU cache).

Optimized Version

Here is the complete code for my optimized version:

#cython: boundscheck=False, nonecheck=False, wraparound=False
import numpy as np
cimport numpy as np

def sphericity2(np.uint8_t [:, :, :] array):

    #Pass an mask array (1's are marked, 0's ignored)
    cdef int xSize,ySize,zSize
    xSize=array.shape[0]
    ySize=array.shape[1]
    zSize=array.shape[2]

    cdef int sa,vol,x,y,z,neighbors,xDiff,yDiff,zDiff,x1,y1,z1
    cdef float onethird,twothirds,sp
    sa=vol=0 #keep running tally of volume and surface area

    for x in range(0, xSize):
        for y in range(0, ySize):
            for z in range(0, zSize):
                if array[x,y,z] == 0:
                    continue

                neighbors=0
                vol+=1

                for xDiff in range(-1, 2):
                    for yDiff in range(-1, 2):
                        for zDiff in range(-1, 2):
                            if abs(xDiff)+abs(yDiff)+abs(zDiff)==1:
                                x1=x+xDiff
                                y1=y+yDiff
                                z1=z+zDiff
                                if x1>=0 and y1>=0 and z1>=0 and x1<xSize and y1<ySize and z1<zSize:
                                    #print '-',x1,y1,z1,array[x1,y1,z1]
                                    if array[x1,y1,z1]:
                                        #print '-',x1,y1,z1,array[x1,y1,z1]
                                        neighbors+=1

                #print 'had this many neighbors',neighbors
                sa+=(6-neighbors)

    onethird=float(1)/float(3)
    twothirds=float(2)/float(3)
    sph = ((np.pi**onethird)*((6*vol)**twothirds)) / sa
    #print 'sphericity',sphericity
    return sph

Sphericity execution time comparison:

Original         : 2.123s
Jaime's          : 1.819s
Optimized Cython : 0.136s
@ moarningsun    : 0.090s

In all the Cython solution runs over 15x faster, with unrolled inner loops (see comment) it runs over 23x faster.

like image 52
Blake Walsh Avatar answered Oct 17 '22 05:10

Blake Walsh


You can get most of what your code is trying to do from vanilla numpy, without any need for Cython. The central thing is to get an efficient way of counting neighbors, which can be done and-ing slices of a mask obtained from your input array. Putting it all together, I think the following does the same as your code, but with a lot less repetition:

def sphericity(arr):
    mask = arr != 0
    vol = np.count_nonzero(mask)
    counts = np.zeros_like(arr, dtype=np.intp)
    for dim, size in enumerate(arr.shape):
        slc = (slice(None),) * dim
        axis_mask = (mask[slc + (slice(None, -1),)] &
                     mask[slc + (slice(1, None),)])
        counts[slc + (slice(None, -1),)] += axis_mask
        counts[slc + (slice(1, None),)] += axis_mask
    sa = np.sum(6 - counts[counts != 0])

    return np.pi**(1./3.)*(6*vol)**(2./3.) / sa
like image 36
Jaime Avatar answered Oct 17 '22 06:10

Jaime