I'm going through this odds ratios in logistic regression tutorial, and trying to get the exactly the same results with the logistic regression module of scikit-learn. With the code below, I am able to get the coefficient and intercept but I could not find a way to find other properties of the model listed in the tutorial such as log-likelyhood, Odds Ratio, Std. Err., z, P>|z|, [95% Conf. Interval]. If someone could show me how to have them calculated with sklearn
package, I would appreciate it.
import pandas as pd
from sklearn.linear_model import LogisticRegression
url = 'https://stats.idre.ucla.edu/wp-content/uploads/2016/02/sample.csv'
df = pd.read_csv(url, na_values=[''])
y = df.hon.values
X = df.math.values
y = y.reshape(200,1)
X = X.reshape(200,1)
clf = LogisticRegression(C=1e5)
clf.fit(X,y)
clf.coef_
clf.intercept_
Unless accompanied by a detailed description of the explanatory variables included in the model, odds ratios cannot be compared across different model specifications or across different study samples, for example, in meta‐analyses.
The odds ratio is calculated by dividing the odds of the first group by the odds in the second group. In the case of the worked example, it is the ratio of the odds of lung cancer in smokers divided by the odds of lung cancer in non-smokers: (647/622)/(2/27)=14.04.
Calculations of odds ratio, relative risk are not possible with linear regression as in case of the logistic regression where we can calculate the odds ratio by: exp(Beta).
You can get the odds ratios by taking the exponent of the coeffecients:
import numpy as np
X = df.female.values.reshape(200,1)
clf.fit(X,y)
np.exp(clf.coef_)
# array([[ 1.80891307]])
As for the other statistics, these are not easy to get from scikit-learn (where model evaluation is mostly done using cross-validation), if you need them you're better off using a different library such as statsmodels.
In addition to @maxymoo's answer, to get other statistics, statsmodel
can be used. Assuming that you have your data in a DataFrame
called df
, the code below should show a good summary:
import pandas as pd
from patsy import dmatrices
import statsmodels.api as sm
y, X = dmatrices( 'label ~ age + gender', data=df, return_type='dataframe')
mod = sm.Logit(y, X)
res = mod.fit()
print res.summary()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With