Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Implementing Support Vector Machine - EFFICIENTLY computing gram-matrix K

I'm implementing SVM for mnist data in Python, for now I'm using cvxopt for solving QP and getting alphas back.

But my problem is computing K-gram matrix ** EFFICIENTLY **, I started with only two class (digits 6 and 0), number of training examples less first 1k, next 10K.

To compute whole 1k x 1k matrix faster, I'm using Process, and giving different raws to compute. But still it takes like 2min- its rbf - gaussian. (10k one is still running!)

If anyone worked on it or may be Python lovers can help me here that would be great!

PS: If someone don't know computing gram-matrix, here is detail: Its simple:

for i in range(1k):
    for j in range(1k):
         for K[i,j] = some_fun(x[i], x[j])

where some_fun - is dot product or fancy gaussian.

I'm using python 2.7, numpy and Mac Air 4G RAM, 128G Solid state.

[EDIT] If anyone ever comes here! Yes SVM DOES take longer ... and if you are doing multi classification then you have to calculate k-gram matrix again .. so its gonna take long so I would suggest that implement algorithm and check it twice and let it run over night! But you gonna see good result next day thats for sure! :)

like image 968
code muncher Avatar asked Mar 21 '13 19:03

code muncher


1 Answers

You're using numpy, right? You should get big speedups by using numpy's matrix operations to compute the full matrix at once, rather than doing slow Python loops to find each pairwise evaluation. For example, if we assume that x is a row-instance data matrix (one row per data point, one column per dimension):

# get a matrix where the (i, j)th element is |x[i] - x[j]|^2
# using the identity (x - y)^T (x - y) = x^T x + y^T y - 2 x^T y
pt_sq_norms = (x ** 2).sum(axis=1)
dists_sq = np.dot(x, x.T)
dists_sq *= -2
dists_sq += pt_sq_norms.reshape(-1, 1)
dists_sq += pt_sq_norms

# turn into an RBF gram matrix
km = dists_sq; del dists_sq
km /= -2 * sigma**2
np.exp(km, km)  # exponentiates in-place

Generating data by np.random.normal(size=(1000, 784)), this takes 70ms on my quad-core i5 iMac. Upping it to 10k data points, it takes just under 7 seconds.

sklearn.metrics.pairwise.rbf_kernel works similarly, though it has some extra input checking and support for sparse matrices and such.

It's also worth noting that in python 2, you should be looping over xrange(1000), not range(1000). range will actually construct a list object to loop over, which takes up some time and maybe more importantly memory. For 10,000 you're probably okay, but this can cause serious problems if your loops get too big.

like image 68
Danica Avatar answered Nov 01 '22 16:11

Danica