Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get odds-ratios and other related features with scikit-learn

I'm going through this odds ratios in logistic regression tutorial, and trying to get the exactly the same results with the logistic regression module of scikit-learn. With the code below, I am able to get the coefficient and intercept but I could not find a way to find other properties of the model listed in the tutorial such as log-likelyhood, Odds Ratio, Std. Err., z, P>|z|, [95% Conf. Interval]. If someone could show me how to have them calculated with sklearn package, I would appreciate it.

import pandas as pd
from sklearn.linear_model import LogisticRegression

url = 'https://stats.idre.ucla.edu/wp-content/uploads/2016/02/sample.csv'
df = pd.read_csv(url, na_values=[''])
y = df.hon.values
X = df.math.values
y = y.reshape(200,1)
X = X.reshape(200,1)
clf = LogisticRegression(C=1e5)
clf.fit(X,y)
clf.coef_
clf.intercept_
like image 485
Erdem KAYA Avatar asked Sep 21 '16 20:09

Erdem KAYA


People also ask

Can you compare odds ratios from different models?

Unless accompanied by a detailed description of the explanatory variables included in the model, odds ratios cannot be compared across different model specifications or across different study samples, for example, in meta‐analyses.

How do you derive odds ratio?

The odds ratio is calculated by dividing the odds of the first group by the odds in the second group. In the case of the worked example, it is the ratio of the odds of lung cancer in smokers divided by the odds of lung cancer in non-smokers: (647/622)/(2/27)=14.04.

Can you get odds ratio from linear regression?

Calculations of odds ratio, relative risk are not possible with linear regression as in case of the logistic regression where we can calculate the odds ratio by: exp(Beta).


2 Answers

You can get the odds ratios by taking the exponent of the coeffecients:

import numpy as np
X = df.female.values.reshape(200,1)
clf.fit(X,y)
np.exp(clf.coef_)

# array([[ 1.80891307]])

As for the other statistics, these are not easy to get from scikit-learn (where model evaluation is mostly done using cross-validation), if you need them you're better off using a different library such as statsmodels.

like image 113
maxymoo Avatar answered Sep 23 '22 18:09

maxymoo


In addition to @maxymoo's answer, to get other statistics, statsmodel can be used. Assuming that you have your data in a DataFrame called df, the code below should show a good summary:

import pandas as pd
from patsy import dmatrices
import statsmodels.api as sm 

y, X = dmatrices( 'label ~ age + gender', data=df, return_type='dataframe')
mod = sm.Logit(y, X)
res = mod.fit()
print res.summary()
like image 30
Erdem KAYA Avatar answered Sep 22 '22 18:09

Erdem KAYA