I'm implementing SVM for mnist data in Python, for now I'm using cvxopt for solving QP and getting alphas back.
But my problem is computing K-gram matrix ** EFFICIENTLY **, I started with only two class (digits 6 and 0), number of training examples less first 1k, next 10K.
To compute whole 1k x 1k matrix faster, I'm using Process, and giving different raws to compute. But still it takes like 2min- its rbf - gaussian. (10k one is still running!)
If anyone worked on it or may be Python lovers can help me here that would be great!
PS: If someone don't know computing gram-matrix, here is detail: Its simple:
for i in range(1k):
for j in range(1k):
for K[i,j] = some_fun(x[i], x[j])
where some_fun - is dot product or fancy gaussian.
I'm using python 2.7, numpy and Mac Air 4G RAM, 128G Solid state.
[EDIT] If anyone ever comes here! Yes SVM DOES take longer ... and if you are doing multi classification then you have to calculate k-gram matrix again .. so its gonna take long so I would suggest that implement algorithm and check it twice and let it run over night! But you gonna see good result next day thats for sure! :)
You're using numpy, right? You should get big speedups by using numpy's matrix operations to compute the full matrix at once, rather than doing slow Python loops to find each pairwise evaluation. For example, if we assume that x
is a row-instance data matrix (one row per data point, one column per dimension):
# get a matrix where the (i, j)th element is |x[i] - x[j]|^2
# using the identity (x - y)^T (x - y) = x^T x + y^T y - 2 x^T y
pt_sq_norms = (x ** 2).sum(axis=1)
dists_sq = np.dot(x, x.T)
dists_sq *= -2
dists_sq += pt_sq_norms.reshape(-1, 1)
dists_sq += pt_sq_norms
# turn into an RBF gram matrix
km = dists_sq; del dists_sq
km /= -2 * sigma**2
np.exp(km, km) # exponentiates in-place
Generating data by np.random.normal(size=(1000, 784))
, this takes 70ms on my quad-core i5 iMac. Upping it to 10k data points, it takes just under 7 seconds.
sklearn.metrics.pairwise.rbf_kernel works similarly, though it has some extra input checking and support for sparse matrices and such.
It's also worth noting that in python 2, you should be looping over xrange(1000)
, not range(1000)
. range
will actually construct a list object to loop over, which takes up some time and maybe more importantly memory. For 10,000 you're probably okay, but this can cause serious problems if your loops get too big.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With