I am working on a logistic regression model and I am having trouble understanding how to take the model fit from my training set onto my testing set. Sorry, I am new to python and VERY new to statsmodels..
import pandas as pd
import statsmodels.api as sm
from sklearn import cross_validation
independent_vars = phy_train.columns[3:]
X_train, X_test, y_train, y_test = cross_validation.train_test_split(phy_train[independent_vars], phy_train['target'], test_size=0.3, random_state=0)
X_train = pd.DataFrame(X_train)
X_train.columns = independent_vars
X_test = pd.DataFrame(X_test)
X_test.columns = independent_vars
y_train = pd.DataFrame(y_train)
y_train.columns = ['target']
y_test = pd.DataFrame(y_test)
y_test.columns = ['target']
logit = sm.Logit(y_train,X_train[subset],missing='drop')
result = logit.fit()
print result.summary()
y_pred = logit.predict(X_test[subset])
From the last line, I get this error:
y_pred = logit.predict(X_test[subset]) Traceback (most recent call last): File "", line 1, in File "C:\Users\eMachine\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\statsmodels\discrete\discrete_model.py", line 378, in predict return self.cdf(np.dot(exog, params)) ValueError: matrices are not aligned
My training and testing data set have the same number of variables so I am sure I am misunderstanding what the logit.predict() is actually doing.
Both libraries have their uses. Before selecting one over the other, it is best to consider the purpose of the model. A model designed for prediction is best fit using scikit-learn, while statsmodels is best employed for explanatory models.
adds a column of ones to the x1 array ( data['SAT'] ). Here is the head of x : As you can see, a column of ones is added to SAT . This column of ones corresponds to x_0 in the simple linear regression equation: y_hat = b_0 * x_0 + b_1 * x_1.
While StatsModels don't have a variety of options, it only offers statistics and econometric tools that are used in statistics software like Stata and R. It has a similar syntax as that of R so, for those who are transitioning to Python, StatsModels is a good choice.
There are two predict methods.
logit
in your example is the model instance. The model instance doesn't know about the estimation results. The model predict has a different signature because it needs the parameters also logit.predict(params, exog)
. This is mainly interesting for internal usage.
What you want is the predict method of the results instance. In your example
y_pred = result.predict(X_test[subset])
should give the correct results. It uses the estimated parameters in the prediction with your new test data of explanatory variables, X_test.
Calling model.fit()
returns an instance of a results class that provides access to additional post-estimation statistics and analysis, and to prediction.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With