logit regression and singular Matrix error in Python

Tags:

am trying to run logit regression for german credit data (www4.stat.ncsu.edu/~boos/var.select/german.credit.html). To test the code, I have used only numerical variables and tried regressing it with the result using the following code.

import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np

df = pd.read_csv("germandata.txt",delimiter=' ')
df.columns = ["chk_acc","duration","history","purpose","amount","savings_acc","employ_since","install_rate","pers_status","debtors","residence_since","property","age","other_plans","housing","existing_credit","job","no_people_liab","telephone","foreign_worker","admit"]

#pls note that I am only retaining numeric variables
cols_to_keep = ['admit','duration', 'amount', 'install_rate','residence_since','age','existing_credit','no_people_liab']

# rank of cols_to_keep is 8
print np.linalg.matrix_rank(df[cols_to_keep].values)
data = df[cols_to_keep]

data['intercept'] = 1.0

train_cols = data.columns[1:]

#to check the rank of train_cols, which in this case is 8
print np.linalg.matrix_rank(data[train_cols].values)

#fit logit model
logit = sm.Logit(data['admit'], data[train_cols])
result = logit.fit()

All the 8.0 columns seem independent when I check the data. Inspite of this I am getting Singular Matrix Error. Can you please help?

Thanks

815

asked Dec 20 '13 12:12

user3122731

2 Answers

The endog y variable needs to be zero, one. In this dataset it has values in 1 and 2. If we subtract one, then it produces the results.

>>> logit = sm.Logit(data['admit'] - 1, data[train_cols])
>>> result = logit.fit()
>>> print result.summary()
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                  admit   No. Observations:                  999
Model:                          Logit   Df Residuals:                      991
Method:                           MLE   Df Model:                            7
Date:                Fri, 19 Sep 2014   Pseudo R-squ.:                 0.05146
Time:                        10:06:06   Log-Likelihood:                -579.09
converged:                       True   LL-Null:                       -610.51
                                        LLR p-value:                 4.103e-11
===================================================================================
                      coef    std err          z      P>|z|      [95.0% Conf. Int.]
-----------------------------------------------------------------------------------
duration            0.0261      0.008      3.392      0.001         0.011     0.041
amount           7.062e-05    3.4e-05      2.075      0.038      3.92e-06     0.000
install_rate        0.2039      0.073      2.812      0.005         0.062     0.346
residence_since     0.0411      0.067      0.614      0.539        -0.090     0.172
age                -0.0213      0.007     -2.997      0.003        -0.035    -0.007
existing_credit    -0.1560      0.130     -1.196      0.232        -0.412     0.100
no_people_liab      0.1264      0.201      0.628      0.530        -0.268     0.521
intercept          -1.5746      0.430     -3.661      0.000        -2.418    -0.732
===================================================================================

However, in other cases it is possible that the Hessian is not positive definite when we evaluate it far away from the optimum, for example at bad starting values. Switching to an optimizer that does not use the Hessian often succeeds in those cases. For example, scipy's 'bfgs' is a good optimizer that works in many cases

result = logit.fit(method='bfgs')

answered Sep 21 '22 05:09

Josef

I've managed to solve this by removing as well low variance columns:

from sklearn.feature_selection import VarianceThreshold

def variance_threshold_selector(data, threshold=0.5):
    # https://stackoverflow.com/a/39813304/1956309
    selector = VarianceThreshold(threshold)
    selector.fit(data)
    return data[data.columns[selector.get_support(indices=True)]]

# min_variance = .9 * (1 - .9)  # You can play here with different values.
min_variance = 0.0001
low_variance = variance_threshold_selector(df, min_variance) 
print('columns removed:')
df.columns ^ low_variance.columns
df.shape
df.shape
X = low_variance
# (Logit(y_train, X), logit.fit()... etc)

To give a bit more of context: I did one-hot encoding to some categorical data prior to this step, and some of the columns had very few 1's

answered Sep 18 '22 05:09

Kieleth

Related questions
                            
                                How are Python modules (which are shared libraries) imported without a .py file?
                            
                                Is a Scripts directory an anti-pattern in Python? If so, what's the right way to import?
                            
                                How to use botocore.response.StreamingBody as stdin PIPE
                            
                                What are the differences between bool() and operator.truth()?
                            
                                Does this prime function actually work?
                            
                                Installing numpy for Windows 10: Importing the multiarray numpy extension module failed
                            
                                Writing a .CSV file in Python that works for both Python 2.7+ and Python 3.3+ in Windows
                            
                                How to run independent transformations in parallel using PySpark?
                            
                                How can we make __future__ imports global?
                            
                                Maven exec plugin - Executing a python script
                            
                                How to use a callback function in python?
                            
                                How would one decorate an inherited method in the child class?
                            
                                Python setuptools using 'scripts' keyword in setup.py
                            
                                Download python package with dependencies without installing
                            
                                pprint jinja2 debug variable helper
                            
                                TypeError: zip argument #1 must support iteration
                            
                                Passing arguments to fsolve
                            
                                python argparse - add action to subparser with no arguments?
                            
                                How to ConfigParse a file keeping multiple values for identical keys?
                            
                                Python 2.7.1: How to Open, Edit and Close a CSV file

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

logit regression and singular Matrix error in Python

Tags:

python-2.7

statsmodels

regression

user3122731

People also ask

2 Answers

Josef

Kieleth

Recent Activity

Donate For Us