I'm a beginner to data analysis in Python and have been having trouble with this particular assignment. I've searched quite widely, but have not been able to identify what's wrong.
I imported a file and set it up as a dataframe. Cleaned the data within the file. However, when I try to fit my model to the data, I get a
Perfect separation detected, results not available
Here is the code:
from scipy import stats
import numpy as np
import pandas as pd
import collections
import matplotlib.pyplot as plt
import statsmodels.api as sm
loansData = pd.read_csv('https://spark- public.s3.amazonaws.com/dataanalysis/loansData.csv')
loansData = loansData.to_csv('loansData_clean.csv', header=True, index=False)
## cleaning the file
loansData['Interest.Rate'] = loansData['Interest.Rate'].map(lambda x: round(float(x.rstrip('%')) / 100, 4))
loanlength = loansData['Loan.Length'].map(lambda x: x.strip('months'))
loansData['FICO.Range'] = loansData['FICO.Range'].map(lambda x: x.split('-'))
loansData['FICO.Range'] = loansData['FICO.Range'].map(lambda x: int(x[0]))
loansData['FICO.Score'] = loansData['FICO.Range']
#add interest rate less than column and populate
## we only care about interest rates less than 12%
loansData['IR_TF'] = pd.Series('', index=loansData.index)
loansData['IR_TF'] = loansData['Interest.Rate'].map(lambda x: True if x < 12 else False)
#create intercept column
loansData['Intercept'] = pd.Series(1.0, index=loansData.index)
# create list of ind var col names
ind_vars = ['FICO.Score', 'Amount.Requested', 'Intercept']
#define logistic regression
logit = sm.Logit(loansData['IR_TF'], loansData[ind_vars])
#fit the model
result = logit.fit()
#get fitted coef
coeff = result.params
print coeff
Any help would be much appreciated!
Thx, A
A complete separation in a logistic regression, sometimes also referred as perfect prediction, happens when the outcome variable separates a predictor variable completely. Below is an example data set, where Y is the outcome variable, and X1 and X2 are predictor variables.
This happens when all or nearly all of the values in one of the predictor categories (or a combination of predictors) are associated with only one of the binary outcome values.
Quasi-complete separation occurs when the dependent variable separates an independent variable or a combination of several independent variables to a certain degree. In other words, levels in a categorical variable or values in numeric variable are separated by groups in a discrete outcome variable.
You have PerfectSeparationError
because your loansData['IR_TF'] only has a single value True
(or 1). You first converted interest rate from % to decimal, so you should define IR_TF as
loansData['IR_TF'] = loansData['Interest.Rate'] < 0.12 #not 12, and you don't need .map
Then your regression will run successfully:
Optimization terminated successfully.
Current function value: 0.319503
Iterations 8
FICO.Score 0.087423
Amount.Requested -0.000174
Intercept -60.125045
dtype: float64
Also, I noticed various places that can be made easier to read and/or gain some performance improvements in particular .map
might not be as fast as vectorized calculations. Here are my suggestions:
from scipy import stats
import numpy as np
import pandas as pd
import collections
import matplotlib.pyplot as plt
import statsmodels.api as sm
loansData = pd.read_csv('https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv')
## cleaning the file
loansData['Interest.Rate'] = loansData['Interest.Rate'].str.rstrip('%').astype(float).round(2) / 100.0
loanlength = loansData['Loan.Length'].str.strip('months')#.astype(int) --> loanlength not used below
loansData['FICO.Score'] = loansData['FICO.Range'].str.split('-', expand=True)[0].astype(int)
#add interest rate less than column and populate
## we only care about interest rates less than 12%
loansData['IR_TF'] = loansData['Interest.Rate'] < 0.12
#create intercept column
loansData['Intercept'] = 1.0
# create list of ind var col names
ind_vars = ['FICO.Score', 'Amount.Requested', 'Intercept']
#define logistic regression
logit = sm.Logit(loansData['IR_TF'], loansData[ind_vars])
#fit the model
result = logit.fit()
#get fitted coef
coeff = result.params
#print coeff
print result.summary() #result has more information
Logit Regression Results
==============================================================================
Dep. Variable: IR_TF No. Observations: 2500
Model: Logit Df Residuals: 2497
Method: MLE Df Model: 2
Date: Thu, 07 Jan 2016 Pseudo R-squ.: 0.5243
Time: 23:15:54 Log-Likelihood: -798.76
converged: True LL-Null: -1679.2
LLR p-value: 0.000
====================================================================================
coef std err z P>|z| [95.0% Conf. Int.]
------------------------------------------------------------------------------------
FICO.Score 0.0874 0.004 24.779 0.000 0.081 0.094
Amount.Requested -0.0002 1.1e-05 -15.815 0.000 -0.000 -0.000
Intercept -60.1250 2.420 -24.840 0.000 -64.869 -55.381
====================================================================================
By the way -- is this P2P lending data?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With