am trying to run logit regression for german credit data (www4.stat.ncsu.edu/~boos/var.select/german.credit.html). To test the code, I have used only numerical variables and tried regressing it with the result using the following code.
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np
df = pd.read_csv("germandata.txt",delimiter=' ')
df.columns = ["chk_acc","duration","history","purpose","amount","savings_acc","employ_since","install_rate","pers_status","debtors","residence_since","property","age","other_plans","housing","existing_credit","job","no_people_liab","telephone","foreign_worker","admit"]
#pls note that I am only retaining numeric variables
cols_to_keep = ['admit','duration', 'amount', 'install_rate','residence_since','age','existing_credit','no_people_liab']
# rank of cols_to_keep is 8
print np.linalg.matrix_rank(df[cols_to_keep].values)
data = df[cols_to_keep]
data['intercept'] = 1.0
train_cols = data.columns[1:]
#to check the rank of train_cols, which in this case is 8
print np.linalg.matrix_rank(data[train_cols].values)
#fit logit model
logit = sm.Logit(data['admit'], data[train_cols])
result = logit.fit()
All the 8.0 columns seem independent when I check the data. Inspite of this I am getting Singular Matrix Error. Can you please help?
Thanks
The only way to get around this error is to simply create a matrix that is not singular. What is this? We don't receive any error when inverting the matrix because the matrix is not singular.
"The error term has a logistic distribution" (3) arises from the derivation of logistic regression from the model where you observe whether or not a latent variable with errors following a logistic distribution exceeds some threshold.
A singular matrix error occurs when the circuit does not have a unique and finite solution. For example, a circuit containing a floating capacitor does not have a unique DC solution as the capacitor can be at any voltage.
A singular matrix is one that is not invertible. This means that the system of equations you are trying to solve does not have a unique solution; linalg.
The endog
y variable needs to be zero, one. In this dataset it has values in 1 and 2. If we subtract one, then it produces the results.
>>> logit = sm.Logit(data['admit'] - 1, data[train_cols])
>>> result = logit.fit()
>>> print result.summary()
Logit Regression Results
==============================================================================
Dep. Variable: admit No. Observations: 999
Model: Logit Df Residuals: 991
Method: MLE Df Model: 7
Date: Fri, 19 Sep 2014 Pseudo R-squ.: 0.05146
Time: 10:06:06 Log-Likelihood: -579.09
converged: True LL-Null: -610.51
LLR p-value: 4.103e-11
===================================================================================
coef std err z P>|z| [95.0% Conf. Int.]
-----------------------------------------------------------------------------------
duration 0.0261 0.008 3.392 0.001 0.011 0.041
amount 7.062e-05 3.4e-05 2.075 0.038 3.92e-06 0.000
install_rate 0.2039 0.073 2.812 0.005 0.062 0.346
residence_since 0.0411 0.067 0.614 0.539 -0.090 0.172
age -0.0213 0.007 -2.997 0.003 -0.035 -0.007
existing_credit -0.1560 0.130 -1.196 0.232 -0.412 0.100
no_people_liab 0.1264 0.201 0.628 0.530 -0.268 0.521
intercept -1.5746 0.430 -3.661 0.000 -2.418 -0.732
===================================================================================
However, in other cases it is possible that the Hessian is not positive definite when we evaluate it far away from the optimum, for example at bad starting values. Switching to an optimizer that does not use the Hessian often succeeds in those cases. For example, scipy's 'bfgs' is a good optimizer that works in many cases
result = logit.fit(method='bfgs')
I've managed to solve this by removing as well low variance columns:
from sklearn.feature_selection import VarianceThreshold
def variance_threshold_selector(data, threshold=0.5):
# https://stackoverflow.com/a/39813304/1956309
selector = VarianceThreshold(threshold)
selector.fit(data)
return data[data.columns[selector.get_support(indices=True)]]
# min_variance = .9 * (1 - .9) # You can play here with different values.
min_variance = 0.0001
low_variance = variance_threshold_selector(df, min_variance)
print('columns removed:')
df.columns ^ low_variance.columns
df.shape
df.shape
X = low_variance
# (Logit(y_train, X), logit.fit()... etc)
To give a bit more of context: I did one-hot encoding to some categorical data prior to this step, and some of the columns had very few 1's
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With