I'm trying to recreate a plot from An Introduction to Statistical Learning and I'm having trouble figuring out how to calculate the confidence interval for a probability prediction. Specifically, I'm trying to recreate the right-hand panel of this figure (figure 7.1) which is predicting the probability that wage>250 based on a degree 4 polynomial of age with associated 95% confidence intervals. The wage data is here if anyone cares. I can predict and plot the predicted probabilities fine with the following code <pre class="prettyprint"><code>import pandas as pd import numpy as np import matplotlib.pyplot as plt import statsmodels.api as sm from sklearn.preprocessing import PolynomialFeatures wage = pd.read_csv('../../data/Wage.csv', index_col=0) wage['wage250'] = 0 wage.loc[wage['wage'] > 250, 'wage250'] = 1 poly = Polynomialfeatures(degree=4) age = poly.fit_transform(wage['age'].values.reshape(-1, 1)) logit = sm.Logit(wage['wage250'], age).fit() age_range_poly = poly.fit_transform(np.arange(18, 81).reshape(-1, 1)) y_proba = logit.predict(age_range_poly) plt.plot(age_range_poly[:, 1], y_proba) </code></pre> But I'm at a loss as to how the confidence intervals of the predicted probabilities are calculated. I have thought about bootstrapping the data many times to get the distribution of probabilities for each age but I know there is an easier way which is just beyond my grasp. I have the estimated coefficient covariance matrix and the standard errors associated with each estimated coefficient. How would I go about calculating the confidence intervals as shown in the right-hand panel of the figure above given this information? Thanks!

You can use delta method to find approximate variance for predicted probability. Namely, <pre class="prettyprint"><code>var(proba) = np.dot(np.dot(gradient.T, cov), gradient) </code></pre> where <code>gradient</code> is the vector of derivatives of predicted probability by model coefficients, and <code>cov</code> is the covariance matrix of coefficients. Delta method is proven to work asymptotically for all maximum likelihood estimates. However, if you have a small training sample, asymptotic methods may not work well, and you should consider bootstrapping. Here is a toy example of applying delta method to logistic regression: <pre class="prettyprint"><code>import numpy as np import statsmodels.api as sm import matplotlib.pyplot as plt # generate data np.random.seed(1) x = np.arange(100) y = (x * 0.5 + np.random.normal(size=100,scale=10)>30) # estimate the model X = sm.add_constant(x) model = sm.Logit(y, X).fit() proba = model.predict(X) # predicted probability # estimate confidence interval for predicted probabilities cov = model.cov_params() gradient = (proba * (1 - proba) * X.T).T # matrix of gradients for each observation std_errors = np.array([np.sqrt(np.dot(np.dot(g, cov), g)) for g in gradient]) c = 1.96 # multiplier for confidence interval upper = np.maximum(0, np.minimum(1, proba + std_errors * c)) lower = np.maximum(0, np.minimum(1, proba - std_errors * c)) plt.plot(x, proba) plt.plot(x, lower, color='g') plt.plot(x, upper, color='g') plt.show() </code></pre> It draws the following nice picture: <img src="https://i.stack.imgur.com/7oZ4D.png" alt="enter image description here"> For your example the code would be <pre class="prettyprint"><code>proba = logit.predict(age_range_poly) cov = logit.cov_params() gradient = (proba * (1 - proba) * age_range_poly.T).T std_errors = np.array([np.sqrt(np.dot(np.dot(g, cov), g)) for g in gradient]) c = 1.96 upper = np.maximum(0, np.minimum(1, proba + std_errors * c)) lower = np.maximum(0, np.minimum(1, proba - std_errors * c)) plt.plot(age_range_poly[:, 1], proba) plt.plot(age_range_poly[:, 1], lower, color='g') plt.plot(age_range_poly[:, 1], upper, color='g') plt.show() </code></pre> and it would give the following picture <img src="https://i.stack.imgur.com/ld7Qs.png" alt="enter image description here"> Looks pretty much like a boa-constrictor with an elephant inside. You could compare it with the bootstrap estimates: <pre class="prettyprint"><code>preds = [] for i in range(1000): boot_idx = np.random.choice(len(age), replace=True, size=len(age)) model = sm.Logit(wage['wage250'].iloc[boot_idx], age[boot_idx]).fit(disp=0) preds.append(model.predict(age_range_poly)) p = np.array(preds) plt.plot(age_range_poly[:, 1], np.percentile(p, 97.5, axis=0)) plt.plot(age_range_poly[:, 1], np.percentile(p, 2.5, axis=0)) plt.show() </code></pre> <img src="https://i.stack.imgur.com/GQZT3.png" alt="enter image description here"> Results of delta method and bootstrap look pretty much the same. Authors of the book, however, go the third way. They use the fact that proba = np.exp(np.dot(x, params)) / (1 + np.exp(np.dot(x, params))) and calculate confidence interval for the linear part, and then transform with the logit function <pre class="prettyprint"><code>xb = np.dot(age_range_poly, logit.params) std_errors = np.array([np.sqrt(np.dot(np.dot(g, cov), g)) for g in age_range_poly]) upper_xb = xb + c * std_errors lower_xb = xb - c * std_errors upper = np.exp(upper_xb) / (1 + np.exp(upper_xb)) lower = np.exp(lower_xb) / (1 + np.exp(lower_xb)) plt.plot(age_range_poly[:, 1], upper) plt.plot(age_range_poly[:, 1], lower) plt.show() </code></pre> So they get the diverging interval: <img src="https://i.stack.imgur.com/haApj.png" alt="enter image description here"> These methods produce so different results because they assume different things (predicted probability and log-odds) being distributed normally. Namely, delta method assumes predicted probabilites are normal, and in the book, log-odds are normal. In fact, none of them are normal in finite samples, and they all converge to normal in infinite samples, but their variances converge to zero at the same time. Maximum likelihood estimates are insensitive to reparametrization, but their estimated distribution is, and that's the problem.

Confidence interval of probability prediction from logistic regression statsmodels

Tags:

python

logistic-regression

statsmodels

confidence-interval

I'm trying to recreate a plot from An Introduction to Statistical Learning and I'm having trouble figuring out how to calculate the confidence interval for a probability prediction. Specifically, I'm trying to recreate the right-hand panel of this figure (figure 7.1) which is predicting the probability that wage>250 based on a degree 4 polynomial of age with associated 95% confidence intervals. The wage data is here if anyone cares.

I can predict and plot the predicted probabilities fine with the following code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.preprocessing import PolynomialFeatures

wage = pd.read_csv('../../data/Wage.csv', index_col=0)
wage['wage250'] = 0
wage.loc[wage['wage'] > 250, 'wage250'] = 1

poly = Polynomialfeatures(degree=4)
age = poly.fit_transform(wage['age'].values.reshape(-1, 1))

logit = sm.Logit(wage['wage250'], age).fit()

age_range_poly = poly.fit_transform(np.arange(18, 81).reshape(-1, 1))

y_proba = logit.predict(age_range_poly)

plt.plot(age_range_poly[:, 1], y_proba)

But I'm at a loss as to how the confidence intervals of the predicted probabilities are calculated. I have thought about bootstrapping the data many times to get the distribution of probabilities for each age but I know there is an easier way which is just beyond my grasp.

I have the estimated coefficient covariance matrix and the standard errors associated with each estimated coefficient. How would I go about calculating the confidence intervals as shown in the right-hand panel of the figure above given this information?

Thanks!

508

asked Nov 21 '17 13:11

Taylor

1 Answers

You can use delta method to find approximate variance for predicted probability. Namely,

var(proba) = np.dot(np.dot(gradient.T, cov), gradient)

where gradient is the vector of derivatives of predicted probability by model coefficients, and cov is the covariance matrix of coefficients.

Delta method is proven to work asymptotically for all maximum likelihood estimates. However, if you have a small training sample, asymptotic methods may not work well, and you should consider bootstrapping.

Here is a toy example of applying delta method to logistic regression:

import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt

# generate data
np.random.seed(1)
x = np.arange(100)
y = (x * 0.5 + np.random.normal(size=100,scale=10)>30)
# estimate the model
X = sm.add_constant(x)
model = sm.Logit(y, X).fit()
proba = model.predict(X) # predicted probability

# estimate confidence interval for predicted probabilities
cov = model.cov_params()
gradient = (proba * (1 - proba) * X.T).T # matrix of gradients for each observation
std_errors = np.array([np.sqrt(np.dot(np.dot(g, cov), g)) for g in gradient])
c = 1.96 # multiplier for confidence interval
upper = np.maximum(0, np.minimum(1, proba + std_errors * c))
lower = np.maximum(0, np.minimum(1, proba - std_errors * c))

plt.plot(x, proba)
plt.plot(x, lower, color='g')
plt.plot(x, upper, color='g')
plt.show()

It draws the following nice picture: enter image description here

For your example the code would be

proba = logit.predict(age_range_poly)
cov = logit.cov_params()
gradient = (proba * (1 - proba) * age_range_poly.T).T 
std_errors = np.array([np.sqrt(np.dot(np.dot(g, cov), g)) for g in gradient])
c = 1.96 
upper = np.maximum(0, np.minimum(1, proba + std_errors * c))
lower = np.maximum(0, np.minimum(1, proba - std_errors * c))

plt.plot(age_range_poly[:, 1], proba)
plt.plot(age_range_poly[:, 1], lower, color='g')
plt.plot(age_range_poly[:, 1], upper, color='g')
plt.show()

and it would give the following picture

enter image description here

Looks pretty much like a boa-constrictor with an elephant inside.

You could compare it with the bootstrap estimates:

preds = []
for i in range(1000):
    boot_idx = np.random.choice(len(age), replace=True, size=len(age))
    model = sm.Logit(wage['wage250'].iloc[boot_idx], age[boot_idx]).fit(disp=0)
    preds.append(model.predict(age_range_poly))
p = np.array(preds)
plt.plot(age_range_poly[:, 1], np.percentile(p, 97.5, axis=0))
plt.plot(age_range_poly[:, 1], np.percentile(p, 2.5, axis=0))
plt.show()

enter image description here

Results of delta method and bootstrap look pretty much the same.

Authors of the book, however, go the third way. They use the fact that

proba = np.exp(np.dot(x, params)) / (1 + np.exp(np.dot(x, params)))

and calculate confidence interval for the linear part, and then transform with the logit function

xb = np.dot(age_range_poly, logit.params)
std_errors = np.array([np.sqrt(np.dot(np.dot(g, cov), g)) for g in age_range_poly])
upper_xb = xb + c * std_errors
lower_xb = xb - c * std_errors
upper = np.exp(upper_xb) / (1 + np.exp(upper_xb))
lower = np.exp(lower_xb) / (1 + np.exp(lower_xb))
plt.plot(age_range_poly[:, 1], upper)
plt.plot(age_range_poly[:, 1], lower)
plt.show()

So they get the diverging interval:

enter image description here

These methods produce so different results because they assume different things (predicted probability and log-odds) being distributed normally. Namely, delta method assumes predicted probabilites are normal, and in the book, log-odds are normal. In fact, none of them are normal in finite samples, and they all converge to normal in infinite samples, but their variances converge to zero at the same time. Maximum likelihood estimates are insensitive to reparametrization, but their estimated distribution is, and that's the problem.

answered Sep 23 '22 23:09

David Dale

Related questions
                            
                                logging: print message only once
                            
                                Python, get base64-encoded MD5 hash of an image object
                            
                                Cannot upgrade Pip. Permission denied. Why?
                            
                                Make a numpy upper triangular matrix padded with Nan instead of zero
                            
                                What does the power operator (**) in python translate into?
                            
                                How to make push button immediately disabled?
                            
                                Number list with no repeats and ordered
                            
                                Type Hinting for objects of type that's being defined [duplicate]
                            
                                Find unique pairs in list of pairs
                            
                                How can I implement a weighted cross entropy loss in tensorflow using sparse_softmax_cross_entropy_with_logits
                            
                                Pandas: Change a specific column name in dataframe having multilevel columns
                            
                                double square brackets side by side in python
                            
                                How to create dynamic plots to display on Flask?
                            
                                How to get the result of multiprocessing.Pool.apply_async
                            
                                How to read a csv file with multiple header rows into pandas? [closed]
                            
                                Intersection of two Counters
                            
                                Yolo Darknet Detecting Only Specific Class like Person, Cat, Dog etc
                            
                                Add column for percentage of total to Pandas dataframe
                            
                                Numpy random choice to produce a 2D-array with all unique values
                            
                                How to sort dataframe based on a column in another dataframe in Pandas?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Confidence interval of probability prediction from logistic regression statsmodels

Tags:

python

logistic-regression

statsmodels

confidence-interval

Taylor

People also ask

1 Answers

David Dale

Recent Activity

Donate For Us