Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using statsmodel estimations with scikit-learn cross validation, is it possible?

I posted this question to Cross Validated forum and later realized may be this would find appropriate audience in stackoverlfow instead.

I am looking for a way I can use the fit object (result) ontained from python statsmodel to feed into cross_val_score of scikit-learn cross_validation method? The attached link suggests that it may be possible but I have not succeeded.

I am getting the following error

estimator should a be an estimator implementing 'fit' method statsmodels.discrete.discrete_model.BinaryResultsWrapper object at 0x7fa6e801c590 was passed

Refer this link

like image 763
CARTman Avatar asked Dec 08 '16 17:12

CARTman


People also ask

Is statsmodels better than Sklearn?

While scikit-learn is slightly faster than statsmodels for 1,000 or less observations, this difference is not significant per the t-test analysis. Sci-kit learn is significantly faster for datasets with more than 1,000 observations.

What is the difference between Cross_val_score and cross_val_predict?

cross_val_score returns score of test fold where cross_val_predict returns predicted y values for the test fold. For the cross_val_score() , you are using the average of the output, which will be affected by the number of folds because then it may have some folds which may have high error (not fit correctly).


1 Answers

Indeed, you cannot use cross_val_score directly on statsmodels objects, because of different interface: in statsmodels

  • training data is passed directly into the constructor
  • a separate object contains the result of model estimation

However, you can write a simple wrapper to make statsmodels objects look like sklearn estimators:

import statsmodels.api as sm from sklearn.base import BaseEstimator, RegressorMixin  class SMWrapper(BaseEstimator, RegressorMixin):     """ A universal sklearn-style wrapper for statsmodels regressors """     def __init__(self, model_class, fit_intercept=True):         self.model_class = model_class         self.fit_intercept = fit_intercept     def fit(self, X, y):         if self.fit_intercept:             X = sm.add_constant(X)         self.model_ = self.model_class(y, X)         self.results_ = self.model_.fit()         return self     def predict(self, X):         if self.fit_intercept:             X = sm.add_constant(X)         return self.results_.predict(X) 

This class contains correct fit and predict methods, and can be used with sklearn, e.g. cross-validated or included into a pipeline. Like here:

from sklearn.datasets import make_regression from sklearn.model_selection import cross_val_score from sklearn.linear_model import LinearRegression  X, y = make_regression(random_state=1, n_samples=300, noise=100)  print(cross_val_score(SMWrapper(sm.OLS), X, y, scoring='r2')) print(cross_val_score(LinearRegression(), X, y, scoring='r2')) 

You can see that the output of two models is identical, because they are both OLS models, cross-validated in the same way.

[0.28592315 0.37367557 0.47972639] [0.28592315 0.37367557 0.47972639] 
like image 133
David Dale Avatar answered Sep 20 '22 02:09

David Dale