Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scikit-learn SVM with a lot of samples / mini batch possible?

According to http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html I read:

"The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples."

I have currently 350,000 samples and 4,500 classes and this number will grow further to 1-2 million samples and 10k + classes.

My problem is that I am running out of memory. All is working as it should when I use just 200,000 samples with less than 1000 classes.

Is there a way to build-in or use something like minibatches with SVM? I saw there exists MiniBatchKMeans but I dont think its for SVM?

Any input welcome!

like image 990
domi771 Avatar asked Nov 22 '16 09:11

domi771


1 Answers

I mentioned this problem in my answer to this question.

You can split your large dataset into batches that can be safely consumed by an SVM algorithm, then find support vectors for each batch separately, and then build a resulting SVM model on a dataset consisting of all the support vectors found in all the batches.

Also if there is no need in using kernels in your case, then you can use sklearn's SGDClassifier, which implements stochastic gradient descent. It fits linear SVM by default.

like image 76
Sergey Zakharov Avatar answered Sep 25 '22 19:09

Sergey Zakharov