Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use grid search for the svm?

I think Machine learning is interesting and I am studying the scikit learn documentation for fun. Below I have done some data cleaning and the thing is that I want to use grid search to find the best values for the parameters.

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score


cats = ['sci.space','rec.autos','rec.motorcycles']
newsgroups_train = fetch_20newsgroups(subset='train',remove=('headers', 'footers', 'quotes'), categories = cats)
newsgroups_test = fetch_20newsgroups(subset='test',remove=('headers', 'footers', 'quotes'), categories = cats)

vectorizer = TfidfVectorizer( stop_words = "english")


vectors = vectorizer.fit_transform(newsgroups_train.data)
vectors_test = vectorizer.transform(newsgroups_test.data)

clf =  SVC(C=0.4,gamma=1,kernel='linear')

clf.fit(vectors, newsgroups_train.target)
vectors_test = vectorizer.transform(newsgroups_test.data)
pred = clf.predict(vectors_test)
print(accuracy_score(newsgroups_test.target, pred))

The accuracy is: 0.849

I have heard of grid search in order to find the optimal value of parameters but I can't understand how to perform it. Can you please elaborate? This is what I tried but is not correct. I would like to learn the correct way along with some explanation. Thanks

Cs = np.array([0.001, 0.01, 0.1, 1, 10])
gammas = np.array([0.001, 0.01, 0.1, 1])
model = SVC()
grid = GridSearchCV(estimator=model, param_grid=dict(Cs=alphas,gamma=gammas))
grid.fit(newsgroups_train.data, newsgroups_train.target)
print(grid)
# summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_.alpha)

EDIT based on the answer received:

parameters = {'C': [1, 10], 
          'gamma': [0.001, 0.01, 1]}
model = SVC()
grid = GridSearchCV(estimator=model, param_grid=parameters)
grid.fit(vectors, newsgroups_train.target)
print(grid)
# summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_)

it returns:

GridSearchCV(cv='warn', error_score='raise-deprecating',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'C': [1, 10], 'gamma': [0.001, 0.01, 1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)
0.8532212885154061
SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

I need clarifications on these:

1)What actually is displayed on the results?
2)Does it also take ranges for C as 1 to 10 or either 1 or 10? 
3)Can you suggest anything    to improve accuracy further?  
4)I noticed that the Tfidf made the accuracy worse even though it 
              cleaned the data from words that dont have any value
like image 821
user11911849 Avatar asked Jan 21 '26 02:01

user11911849


1 Answers

You want to pass a dictionary of parameters where the keys are the name of the parameter as defined by the model's documentation (1). The values should be a list of the values you would like to try.

The grid search will then call every possible combination of those parameters. There are some good examples with the documentation (2).

  1. https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
  2. https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

For your script, you also want to make sure that you are feeding the grid search the correct training data, in this case, 'vectors' not 'newsgroups_test.data'.

See below:

parameters = {'C': [1, 10], 
          'gamma': [0.001, 0.01, 1]}
model = SVC()
grid = GridSearchCV(estimator=model, param_grid=parameters)
grid.fit(vectors, newsgroups_train.target)
print(grid)
# summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_)

Please accept the answer if it works. Good luck!

like image 193
db702 Avatar answered Jan 22 '26 16:01

db702



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!