Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to fit a model to my testing set in statsmodels (python)

I am working on a logistic regression model and I am having trouble understanding how to take the model fit from my training set onto my testing set. Sorry, I am new to python and VERY new to statsmodels..

import pandas as pd
import statsmodels.api as sm
from sklearn import cross_validation

independent_vars = phy_train.columns[3:]
X_train, X_test, y_train, y_test = cross_validation.train_test_split(phy_train[independent_vars], phy_train['target'], test_size=0.3, random_state=0)
X_train = pd.DataFrame(X_train)
X_train.columns = independent_vars
X_test = pd.DataFrame(X_test)
X_test.columns = independent_vars
y_train = pd.DataFrame(y_train)
y_train.columns = ['target']
y_test = pd.DataFrame(y_test)
y_test.columns = ['target']
logit = sm.Logit(y_train,X_train[subset],missing='drop')
result = logit.fit()
print result.summary()

y_pred = logit.predict(X_test[subset])

From the last line, I get this error:

y_pred = logit.predict(X_test[subset]) Traceback (most recent call last): File "", line 1, in File "C:\Users\eMachine\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\statsmodels\discrete\discrete_model.py", line 378, in predict return self.cdf(np.dot(exog, params)) ValueError: matrices are not aligned

My training and testing data set have the same number of variables so I am sure I am misunderstanding what the logit.predict() is actually doing.

like image 615
statsNoob Avatar asked Apr 13 '14 21:04

statsNoob


People also ask

Is statsmodels better than Sklearn?

Both libraries have their uses. Before selecting one over the other, it is best to consider the purpose of the model. A model designed for prediction is best fit using scikit-learn, while statsmodels is best employed for explanatory models.

What is ADD constant in statsmodels?

adds a column of ones to the x1 array ( data['SAT'] ). Here is the head of x : As you can see, a column of ones is added to SAT . This column of ones corresponds to x_0 in the simple linear regression equation: y_hat = b_0 * x_0 + b_1 * x_1.

Is Python statsmodels good?

While StatsModels don't have a variety of options, it only offers statistics and econometric tools that are used in statistics software like Stata and R. It has a similar syntax as that of R so, for those who are transitioning to Python, StatsModels is a good choice.


1 Answers

There are two predict methods.

logit in your example is the model instance. The model instance doesn't know about the estimation results. The model predict has a different signature because it needs the parameters also logit.predict(params, exog). This is mainly interesting for internal usage.

What you want is the predict method of the results instance. In your example

y_pred = result.predict(X_test[subset])

should give the correct results. It uses the estimated parameters in the prediction with your new test data of explanatory variables, X_test.

Calling model.fit() returns an instance of a results class that provides access to additional post-estimation statistics and analysis, and to prediction.

like image 199
Josef Avatar answered Sep 23 '22 06:09

Josef