Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to vectorize finding max value in numpy array with if statement?

My Setup: Python 2.7.4.1, Numpy MKL 1.7.1, Windows 7 x64, WinPython

Context:

I tried to implement the Sequential Minimal Optimization algorithm for solving SVM. I use maximal violating pair approach.

The problem:

In working set selection procedure i want to find maximum value of gradient and its index for elements which met some condition, y[i]*alpha[i]<0 or y[i]*alpha[i]

#y - array of -1 and 1
y=np.array([-1,1,1,1,-1,1])
#alpha- array of floats in range [0,C]
alpha=np.array([0.4,0.1,1.33,0,0.9,0])
#grad - array of floats
grad=np.array([-1,-1,-0.2,-0.4,0.4,0.2])


GMaxI=float('-inf')       
GMax_idx=-1        
n=alpha.shape[0] #usually n=100000
C=4
B=[0,0,C]
for i in xrange(0,n):
    yi=y[i]  #-1 or 1
    alpha_i=alpha[i]

    if (yi * alpha_i< B[yi+1]): # B[-1+1]=0 B[1+1]=C
        if( -yi*grad[i]>=GMaxI):
            GMaxI= -yi*grad[i]
            GMax_idx = i

This procedure is called many times (~50000) and profiler shows that this is the bottleneck. It is possible to vectorize this code?

Edit 1: Add some small exemplary data

Edit 2: I have checked solution proposed by hwlau , larsmans and Mr E. Only solutions proposed Mr E is correct. Below sample code with all three answers:

import numpy as np
y=np.array([   -1,  -1,   -1,  -1,   -1,   -1,   -1,   -1])
alpha=np.array([0,  0.9,  0.4, 0.1, 1.33,    0,  0.9,    0])
grad=np.array([-3, -0.5,   -1,  -1, -0.2,   -4, -0.4, -0.3])
C=4
B=np.array([0,0,C])

#hwlau - wrong index and value
filter = (y*alpha < C*0.5*(y+1)).astype('float')
GMax_idx = (filter*(-y*grad)).argmax()
GMax = -y[GMax_idx]*grad[GMax_idx]

print GMax_idx,GMax

#larsmans - wrong index
neg_y_grad = (-y * grad)[y * alpha < B[y + 1]]
GMaxI = np.max(neg_y_grad)
GMax_ind = np.argmax(neg_y_grad)

print GMax_ind,GMaxI

#Mr E - correct result
BY = np.take(B, y+1)
valid_mask = (y * alpha < BY)
values = -y * grad
values[~valid_mask] = np.min(values) - 1.0

GMaxI = values.max()
GMax_idx = values.argmax()
print GMax_idx,GMaxI

Output (GMax_idx, GMaxI)
0 -3.0 
3 -0.2
4 -0.2

Conclusions

After checking all solutions, the fastest one (2x-6x) is solution proposed by @ali_m. However it requires to install some python packages: numba and all its prerequisites.

I have some trouble to use numba with class methods, so I create global functions which are autojited with numba, my solution look something like this:

from numba import autojit  

@autojit
def FindMaxMinGrad(A,B,alpha,grad,y):
    '''
    Finds i,j indices with maximal violatin pair scheme
    A,B - 3 dim arrays, contains bounds A=[-C,0,0], B=[0,0,C]
    alpha - array like, contains alpha coeficients
    grad - array like, gradient
    y - array like, labels
    '''
    GMaxI=-100000
    GMaxJ=-100000

    GMax_idx=-1
    GMin_idx=-1

    for i in range(0,alpha.shape[0]):

        if (y[i] * alpha[i]< B[y[i]+1]):
            if( -y[i]*grad[i]>GMaxI):
                GMaxI= -y[i]*grad[i]
                GMax_idx = i

        if (y[i] * alpha[i]> A[y[i]+1]):
            if( y[i]*grad[i]>GMaxJ):
                GMaxJ= y[i]*grad[i]
                GMin_idx = i

    return (GMaxI,GMaxJ,GMax_idx,GMin_idx)  

class SVM(object):

    def working_set(self,....):
        FindMaxMinGrad(.....)
like image 561
ksopyla Avatar asked Dec 19 '13 12:12

ksopyla


People also ask

What is the way to find the maximum number in the Numpy array?

Python Numpy – Get Maximum Value of Array Given a numpy array, you can find the maximum value of all the elements in the array. To get the maximum value of a Numpy Array, you can use numpy function numpy. max() function.

What does vectorize do in Numpy?

Define a vectorized function which takes a nested sequence of objects or numpy arrays as inputs and returns a single numpy array or a tuple of numpy arrays. The vectorized function evaluates pyfunc over successive tuples of the input arrays like the python map function, except it uses the broadcasting rules of numpy.

Does Numpy vectorize fast?

Again, some have observed vectorize to be faster than normal for loops, but even the NumPy documentation states: “The vectorize function is provided primarily for convenience, not for performance.


1 Answers

You can probably do quite a lot better than plain vectorization if you use numba to JIT-compile your original code that used nested loops.

import numpy as np
from numba import autojit

@autojit
def jit_max_grad(y, alpha, grad, B):
    maxgrad = -inf
    maxind = -1
    for ii in xrange(alpha.shape[0]):
        if (y[ii] * alpha[ii] < B[y[ii] + 1]):
            g = -y[ii] * grad[ii]
            if g >= maxgrad:
                maxgrad = g
                maxind = ii
    return maxind, maxgrad

For comparison, here's Mr E's vectorized version:

def mr_e_max_grad(y, alpha, grad, B):

    BY = np.take(B, y+1)
    valid_mask = (y * alpha < BY)
    values = -y * grad
    values[~valid_mask] = np.min(values) - 1.0
    GMaxI = values.max()
    GMax_idx = values.argmax()
    return GMax_idx, GMaxI

Timing:

y = np.array([   -1,  -1,   -1,  -1,   -1,   -1,   -1,   -1])
alpha = np.array([0,  0.9,  0.4, 0.1, 1.33,    0,  0.9,    0])
grad = np.array([-3, -0.5,   -1,  -1, -0.2,   -4, -0.4, -0.3])
C = 4
B = np.array([0,0,C])

%timeit mr_e_max_grad(y, alpha, grad, B)
# 100000 loops, best of 3: 19.1 µs per loop

%timeit jit_max_grad(y, alpha, grad, B)
# 1000000 loops, best of 3: 1.07 µs per loop

Update: if you want to see what the timings look like on bigger arrays, it's easy to define a function that generates semi-realistic fake data based on your description in the question:

def make_fake(n, C=4):
    y = np.random.choice((-1, 1), n)
    alpha = np.random.rand(n) * C
    grad = np.random.randn(n)
    B = np.array([0,0,C])
    return y, alpha, grad, B

%%timeit y, alpha, grad, B = make_fake(100000, 4)
mr_e_max_grad(y, alpha, grad, B)
# 1000 loops, best of 3: 1.83 ms per loop

%%timeit y, alpha, grad, B = make_fake(100000, 4)
jit_max_grad(y, alpha, grad, B)
# 1000 loops, best of 3: 471 µs per loop
like image 97
ali_m Avatar answered Sep 22 '22 15:09

ali_m