I'm working on teaching myself a bit of logistic regression using python. I'm trying to apply the lessons in the walkthrough here to the small dataset in the wikipedia entryhere.
Something doesn't seem quite right. Wikipedia and Excel Solver (verified using the method in this video) give intercept -4.0777 and coefficient 1.5046, but the code I built out from the github example outputs -0.924200 and 0.756024, respectively.
The code I'm attempting to use is below. Any glaring mistakes?
import numpy as np
import pandas as pd
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25,
3.5,4.0,4.25,4.5,4.75,5.0,5.5]
y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]
zipped = list(zip(X,y))
df = pd.DataFrame(zipped,columns = ['study_hrs','p_or_f'])
y, X = dmatrices('p_or_f ~ study_hrs',
df, return_type="dataframe")
y = np.ravel(y)
model = LogisticRegression()
model = model.fit(X,y)
print(pd.DataFrame(np.transpose(model.coef_),X.columns))
>>>
0
Intercept -0.924200
study_hrs 0.756024
Just change the model creation line to
model = LogisticRegression(C=100000, fit_intercept=False)
By default, sklearn solves regularized LogisticRegression, with fitting strength C=1
(small C-big regularization, big C-small regularization).
This class implements regularized logistic regression using the liblinear library, newton-cg and lbfgs solvers. It can handle both dense and sparse input. Use C-ordered arrays or CSR matrices containing 64-bit floats for optimal performance; any other input format will be converted (and copied).
Thus to obtain their model you should fit
model = LogisticRegression(C=1000000)
which gives
Intercept -2.038853 # this is actually half the intercept
study_hrs 1.504643 # this is correct
Furthermore the problem also lies in the way you work with data in patsy, see the simplified, correct example
import numpy as np
from sklearn.linear_model import LogisticRegression
X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25,
3.5,4.0,4.25,4.5,4.75,5.0,5.5]
y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]
X = np.array([[x] for x in X])
y = np.ravel(y)
model = LogisticRegression(C=1000000.)
model = model.fit(X,y)
print('coef', model.coef_)
print('intercept', model.intercept_)
gives
coef [[ 1.50464059]]
intercept [-4.07769916]
What is the problem exactly? When you do dmatrices
it by default embeds your input data with a column of ones (biases)
X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25,
3.5,4.0,4.25,4.5,4.75,5.0,5.5]
y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]
zipped = list(zip(X,y))
df = pd.DataFrame(zipped,columns = ['study_hrs','p_or_f'])
y, X = dmatrices('p_or_f ~ study_hrs',
df, return_type="dataframe")
print(X)
which leads to
Intercept study_hrs
0 1 0.50
1 1 0.75
2 1 1.00
3 1 1.25
4 1 1.50
5 1 1.75
6 1 1.75
7 1 2.00
8 1 2.25
9 1 2.50
10 1 2.75
11 1 3.00
12 1 3.25
13 1 3.50
14 1 4.00
15 1 4.25
16 1 4.50
17 1 4.75
18 1 5.00
19 1 5.50
and this is why the resulting bias is just a half of the true one - scikit learns also added a column of ones... so you now have two biases, thus optimal solution is to give each of them half of the weight which would be given to a single one.
So what you can do?
.
import numpy as np
import pandas as pd
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25,
3.5,4.0,4.25,4.5,4.75,5.0,5.5]
y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]
zipped = list(zip(X,y))
df = pd.DataFrame(zipped,columns = ['study_hrs','p_or_f'])
y, X = dmatrices('p_or_f ~ study_hrs',
df, return_type="dataframe")
y = np.ravel(y)
model = LogisticRegression(C=100000, fit_intercept=False)
model = model.fit(X,y)
print(pd.DataFrame(np.transpose(model.coef_),X.columns))
gives
Intercept -4.077571
study_hrs 1.504597
as desired
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With