Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python logistic regression (beginner)

I'm working on teaching myself a bit of logistic regression using python. I'm trying to apply the lessons in the walkthrough here to the small dataset in the wikipedia entryhere.

Something doesn't seem quite right. Wikipedia and Excel Solver (verified using the method in this video) give intercept -4.0777 and coefficient 1.5046, but the code I built out from the github example outputs -0.924200 and 0.756024, respectively.

The code I'm attempting to use is below. Any glaring mistakes?

import numpy as np
import pandas as pd
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression


X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25,
3.5,4.0,4.25,4.5,4.75,5.0,5.5]
y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]

zipped = list(zip(X,y))
df = pd.DataFrame(zipped,columns = ['study_hrs','p_or_f'])

y, X = dmatrices('p_or_f ~ study_hrs',
                  df, return_type="dataframe")

y = np.ravel(y)

model = LogisticRegression()
model = model.fit(X,y)
print(pd.DataFrame(np.transpose(model.coef_),X.columns))

>>>
                  0
Intercept -0.924200
study_hrs  0.756024
like image 274
drew_is_good Avatar asked Dec 04 '15 16:12

drew_is_good


1 Answers

Solution

Just change the model creation line to

model = LogisticRegression(C=100000, fit_intercept=False)

Analysis of the problem

By default, sklearn solves regularized LogisticRegression, with fitting strength C=1 (small C-big regularization, big C-small regularization).

This class implements regularized logistic regression using the liblinear library, newton-cg and lbfgs solvers. It can handle both dense and sparse input. Use C-ordered arrays or CSR matrices containing 64-bit floats for optimal performance; any other input format will be converted (and copied).

Thus to obtain their model you should fit

model = LogisticRegression(C=1000000)

which gives

Intercept -2.038853 # this is actually half the intercept
study_hrs  1.504643 # this is correct

Furthermore the problem also lies in the way you work with data in patsy, see the simplified, correct example

import numpy as np
from sklearn.linear_model import LogisticRegression

X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25,
3.5,4.0,4.25,4.5,4.75,5.0,5.5]
y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]

X = np.array([[x] for x in X])
y = np.ravel(y)

model = LogisticRegression(C=1000000.)
model = model.fit(X,y)

print('coef', model.coef_)
print('intercept', model.intercept_)

gives

coef [[ 1.50464059]]
intercept [-4.07769916]

What is the problem exactly? When you do dmatrices it by default embeds your input data with a column of ones (biases)

X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25,
3.5,4.0,4.25,4.5,4.75,5.0,5.5]
y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]

zipped = list(zip(X,y))
df = pd.DataFrame(zipped,columns = ['study_hrs','p_or_f'])

y, X = dmatrices('p_or_f ~ study_hrs',
                  df, return_type="dataframe")

print(X)

which leads to

    Intercept  study_hrs
0           1       0.50
1           1       0.75
2           1       1.00
3           1       1.25
4           1       1.50
5           1       1.75
6           1       1.75
7           1       2.00
8           1       2.25
9           1       2.50
10          1       2.75
11          1       3.00
12          1       3.25
13          1       3.50
14          1       4.00
15          1       4.25
16          1       4.50
17          1       4.75
18          1       5.00
19          1       5.50

and this is why the resulting bias is just a half of the true one - scikit learns also added a column of ones... so you now have two biases, thus optimal solution is to give each of them half of the weight which would be given to a single one.

So what you can do?

  • do not use patsy in such a way
  • forbid patsy to add a bias
  • tell sklearn not to add bias

.

import numpy as np
import pandas as pd
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression

X = [0.5,0.75,1.0,1.25,1.5,1.75,1.75,2.0,2.25,2.5,2.75,3.0,3.25,
3.5,4.0,4.25,4.5,4.75,5.0,5.5]
y = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]

zipped = list(zip(X,y))
df = pd.DataFrame(zipped,columns = ['study_hrs','p_or_f'])

y, X = dmatrices('p_or_f ~ study_hrs',
                  df, return_type="dataframe")

y = np.ravel(y)

model = LogisticRegression(C=100000, fit_intercept=False)
model = model.fit(X,y)
print(pd.DataFrame(np.transpose(model.coef_),X.columns))

gives

Intercept -4.077571
study_hrs  1.504597

as desired

like image 73
lejlot Avatar answered Sep 30 '22 15:09

lejlot