Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to print estimated coefficients after a (GridSearchCV) fit a model? (SGDRegressor)

I am new to scikit-learn, but it did what I was hoping for. Now, maddeningly, the only remaining issue is that I don't find how I could print (or even better, write to a small text file) all the coefficients it estimated, all the features it selected. What is the way to do this?

Same with SGDClassifier, but I think it is the same for all base objects that can be fit, with cross validation or without. Full script below.

import scipy as sp
import numpy as np
import pandas as pd
import multiprocessing as mp
from sklearn import grid_search
from sklearn import cross_validation
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDClassifier


def main():
    print("Started.")
    # n = 10**6
    # notreatadapter = iopro.text_adapter('S:/data/controls/notreat.csv', parser='csv')
    # X = notreatadapter[1:][0:n]
    # y = notreatadapter[0][0:n]
    notreatdata = pd.read_stata('S:/data/controls/notreat.dta')
    notreatdata = notreatdata.iloc[:10000,:]
    X = notreatdata.iloc[:,1:]
    y = notreatdata.iloc[:,0]
    n = y.shape[0]

    print("Data lodaded.")
    X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.4, random_state=0)

    print("Data split.")
    scaler = StandardScaler()
    scaler.fit(X_train)  # Don't cheat - fit only on training data
    X_train = scaler.transform(X_train)
    X_test = scaler.transform(X_test)  # apply same transformation to test data

    print("Data scaled.")
    # build a model
    model = SGDClassifier(penalty='elasticnet',n_iter = np.ceil(10**6 / n),shuffle=True)
    #model.fit(X,y)

    print("CV starts.")
    # run grid search
    param_grid = [{'alpha' : 10.0**-np.arange(1,7),'l1_ratio':[.05, .15, .5, .7, .9, .95, .99, 1]}]
    gs = grid_search.GridSearchCV(model,param_grid,n_jobs=8,verbose=1)
    gs.fit(X_train, y_train)

    print("Scores for alphas:")
    print(gs.grid_scores_)
    print("Best estimator:")
    print(gs.best_estimator_)
    print("Best score:")
    print(gs.best_score_)
    print("Best parameters:")
    print(gs.best_params_)


if __name__=='__main__':
    mp.freeze_support()
    main()
like image 915
László Avatar asked Jun 23 '14 23:06

László


3 Answers

The SGDClassifier instance fitted with the best hyperparameters is stored in gs.best_estimator_. The coef_ and intercept_ are the fitted parameters of that best model.

like image 169
ogrisel Avatar answered Oct 08 '22 22:10

ogrisel


  • From an estimator, you can get the coefficients with coef_ attribute.
  • From a pipeline you can get the model with the named_steps attribute then get the coefficients with coef_.
  • From a grid search, you can get the model (best model) with best_estimator_, then get the named_steps to get the pipeline and then get the coef_.

Example:

from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

pipe = Pipeline([
    ("scale", StandardScaler()),
    ("model", LinearSVC())
])

# from pipe:
pipe.fit(X, y);
coefs = pipe.named_steps.model.coef_

# from gridsearch:
gs_svc_model = GridSearchCV(estimator=pipe,
                    param_grid={
                      'model__C': [.01, .1, 10, 100, 1000],
                    },
                    cv=5,
                    n_jobs = -1)
gs_svc_model.fit(X, y);
coefs = gs_svc_model.best_estimator_.named_steps.model.coef_
like image 2
Abdelhak Mahmoudi Avatar answered Oct 08 '22 22:10

Abdelhak Mahmoudi


I think you might be looking for estimated parameters of the "best" model rather than the hyper-parameters determined through grid-search. You can plug the best hyper-parameters from grid-search ('alpha' and 'l1_ratio' in your case) back to the model ('SGDClassifier' in your case) to train again. You can then find the parameters from the fitted model object.

The code could be something like this:

model2 = SGDClassifier(penalty='elasticnet',n_iter = np.ceil(10**6 / n),shuffle=True, alpha = gs.best_params_['alpha'], l1_ratio=gs.best_params_['l1_ratio'])
print(model2.coef_)
like image 1
Ted Avatar answered Oct 08 '22 23:10

Ted