Making SVM run faster in python

Tags:

Using the code below for svm in python:

from sklearn import datasets
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
iris = datasets.load_iris()
X, y = iris.data, iris.target
clf = OneVsRestClassifier(SVC(kernel='linear', probability=True, class_weight='auto'))
clf.fit(X, y)
proba = clf.predict_proba(X)

But it is taking a huge amount of time.

Actual Data Dimensions:

train-set (1422392,29)
test-set (233081,29)

How can I speed it up(parallel or some other way)? Please help. I have already tried PCA and downsampling.

I have 6 classes. Edit: Found http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html but I wish for probability estimates and it seems not to so for svm.

Edit:

from sklearn import datasets
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC,LinearSVC
from sklearn.linear_model import SGDClassifier
import joblib
import numpy as np
from sklearn import grid_search
import multiprocessing
import numpy as np
import math

def new_func(a):                              #converts array(x) elements to (1/(1 + e(-x)))
    a=1/(1 + math.exp(-a))
    return a

if __name__ == '__main__':
    iris = datasets.load_iris()
    cores=multiprocessing.cpu_count()-2
    X, y = iris.data, iris.target                       #loading dataset

    C_range = 10.0 ** np.arange(-4, 4);                  #c value range 
    param_grid = dict(estimator__C=C_range.tolist())              

    svr = OneVsRestClassifier(LinearSVC(class_weight='auto'),n_jobs=cores) ################LinearSVC Code faster        
    #svr = OneVsRestClassifier(SVC(kernel='linear', probability=True,  ##################SVC code slow
    #   class_weight='auto'),n_jobs=cores)

    clf = grid_search.GridSearchCV(svr, param_grid,n_jobs=cores,verbose=2)  #grid search
    clf.fit(X, y)                                                   #training svm model                                     

    decisions=clf.decision_function(X)                             #outputs decision functions
    #prob=clf.predict_proba(X)                                     #only for SVC outputs probablilites
    print decisions[:5,:]
    vecfunc = np.vectorize(new_func)
    prob=vecfunc(decisions)                                        #converts deicision to (1/(1 + e(-x)))
    print prob[:5,:]

Edit 2: The answer by user3914041 yields very poor probability estimates.

396

asked Jul 28 '15 15:07

Abhishek Bhatia

5 Answers

If you want to stick with SVC as much as possible and train on the full dataset, you can use ensembles of SVCs that are trained on subsets of the data to reduce the number of records per classifier (which apparently has quadratic influence on complexity). Scikit supports that with the BaggingClassifier wrapper. That should give you similar (if not better) accuracy compared to a single classifier, with much less training time. The training of the individual classifiers can also be set to run in parallel using the n_jobs parameter.

Alternatively, I would also consider using a Random Forest classifier - it supports multi-class classification natively, it is fast and gives pretty good probability estimates when min_samples_leaf is set appropriately.

I did a quick tests on the iris dataset blown up 100 times with an ensemble of 10 SVCs, each one trained on 10% of the data. It is more than 10 times faster than a single classifier. These are the numbers I got on my laptop:

Single SVC: 45s

Ensemble SVC: 3s

Random Forest Classifier: 0.5s

See below the code that I used to produce the numbers:

import time
import numpy as np
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn import datasets
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC

iris = datasets.load_iris()
X, y = iris.data, iris.target

X = np.repeat(X, 100, axis=0)
y = np.repeat(y, 100, axis=0)
start = time.time()
clf = OneVsRestClassifier(SVC(kernel='linear', probability=True, class_weight='auto'))
clf.fit(X, y)
end = time.time()
print "Single SVC", end - start, clf.score(X,y)
proba = clf.predict_proba(X)

n_estimators = 10
start = time.time()
clf = OneVsRestClassifier(BaggingClassifier(SVC(kernel='linear', probability=True, class_weight='auto'), max_samples=1.0 / n_estimators, n_estimators=n_estimators))
clf.fit(X, y)
end = time.time()
print "Bagging SVC", end - start, clf.score(X,y)
proba = clf.predict_proba(X)

start = time.time()
clf = RandomForestClassifier(min_samples_leaf=20)
clf.fit(X, y)
end = time.time()
print "Random Forest", end - start, clf.score(X,y)
proba = clf.predict_proba(X)

If you want to make sure that each record is used only once for training in the BaggingClassifier, you can set the bootstrap parameter to False.

answered Oct 09 '22 03:10

Alexander Bauer

SVM classifiers don't scale so easily. From the docs, about the complexity of sklearn.svm.SVC.

The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples.

In scikit-learn you have svm.linearSVC which can scale better. Apparently it could be able to handle your data.

Alternatively you could just go with another classifier. If you want probability estimates I'd suggest logistic regression. Logistic regression also has the advantage of not needing probability calibration to output 'proper' probabilities.

Edit:

I did not know about linearSVC complexity, finally I found information in the user guide:

Also note that for the linear case, the algorithm used in LinearSVC by the liblinear implementation is much more efficient than its libsvm-based SVC counterpart and can scale almost linearly to millions of samples and/or features.

To get probability out of a linearSVC check out this link. It is just a couple links away from the probability calibration guide I linked above and contains a way to estimate probabilities. Namely:

    prob_pos = clf.decision_function(X_test)
    prob_pos = (prob_pos - prob_pos.min()) / (prob_pos.max() - prob_pos.min())

Note the estimates will probably be poor without calibration, as illustrated in the link.

answered Oct 09 '22 04:10

ldirer

You can use the kernel_approximation module to scale up SVMs to a large number of samples like this.

answered Oct 09 '22 03:10

Andreas Mueller

It was briefly mentioned in the top answer; here is the code: The quickest way to do this is via the n_jobs parameter: replace the line

clf = OneVsRestClassifier(SVC(kernel='linear', probability=True, class_weight='auto'))

with

clf = OneVsRestClassifier(SVC(kernel='linear', probability=True, class_weight='auto'), n_jobs=-1)

This will use all available CPUs on your Computer, while still doing the same computation as before.

answered Oct 09 '22 04:10

serv-inc

For large datasets consider using LinearSVC or SGDClassifier instead, possibly after a Nystroem transformer.

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

answered Oct 09 '22 02:10

Mahmuda Keya

Related questions
                            
                                Merge lists that share common elements
                            
                                Converting date between DD/MM/YYYY and YYYY-MM-DD?
                            
                                Tree plotting in Python
                            
                                Rotate tick labels for seaborn barplot
                            
                                How do I display tooltips in Tkinter?
                            
                                randomizing two lists and maintaining order in python
                            
                                How to log Keras loss output to a file
                            
                                Django -- Can't get static CSS files to load
                            
                                Create a PyCharm configuration that runs a module a la "python -m foo"
                            
                                Counting positive integer elements in a list with Python list comprehensions
                            
                                Get total physical memory in Python
                            
                                How to add noise (Gaussian/salt and pepper etc) to image in Python with OpenCV [duplicate]
                            
                                Why can a floating point dictionary key overwrite an integer key with the same value?
                            
                                Typing Greek letters etc. in Python plots
                            
                                subprocess.Popen(): OSError: [Errno 8] Exec format error in python?
                            
                                Force django-admin startproject if project folder already exists
                            
                                Background color for Tk in Python
                            
                                Is there a Python equivalent of range(n) for multidimensional ranges?
                            
                                Can't install virtualenvwrapper on OSX 10.11 El Capitan
                            
                                Python: "TypeError: __str__ returned non-string" but still prints to output?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With