Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Drawing regression line, confidence interval, and prediction interval in Python

I'm new to the regression game and hope to plot a functionally arbitrary, nonlinear regression line (plus confidence and prediction intervals) for a subset of data that satisfies a certain condition (i.e. with mean replicate value exceeding a threshold; see below).

The data is generated for independent variable x across 20 different values: x=(20-np.arange(20))**2, with rep_num=10 replicates for each condition. The data shows strong nonlinearity across x and looks like the following:

import numpy as np

mu = [.40, .38, .39, .35, .37, .33, .34, .28, .11, .24,
      .03, .07, .01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]     

data = np.zeros((20, rep_num))
for i in range(13):
    data[i] = np.clip(np.random.normal(loc=mu[i], scale=0.1, size=rep_num), 0., 1.)

I can make a scatter plot of the data; the replicate means are shown by the red dots:

import matplotlib.pyplot as plt

plt.scatter(np.log10(np.tile(x[:,None], rep_num)), data, 
            facecolors='none', edgecolors='k', alpha=0.25)
plt.plot(np.log10(x), data.mean(1), 'ro', alpha=0.8)
plt.plot(np.log10(x), np.repeat(0., 20), 'k--')
plt.xlim(-0.02, np.max(np.log10(x)) + 0.02)
plt.ylim(-0.01, 0.7)

scatter plot

My goal is to plot a regression line for only those data that have replicate mean > 0.02. In addition, I would like to add a 95% confidence interval (black dashed lines) around the regression, as well as a 95% prediction interval (blue dashed lines) -- ideally, the prediction interval can also be colored in with transparent blue background.

The final plot (without the blue background inside the prediction interval) would look something like this:

enter image description here

How would I make this? My online search yielded very different partial approaches using seaborn, scipy, and statsmodels. The applications of some of those template functions did not appear to work alongside the existing matplotlib scatter plot.

like image 841
neither-nor Avatar asked Aug 27 '17 22:08

neither-nor


People also ask

What is 95% confidence interval of a regression line?

The 95% confidence interval is commonly interpreted as there is a 95% probability that the true linear regression line of the population will lie within the confidence interval of the regression line calculated from the sample data.

What is the difference between the confidence interval and prediction interval in regression analysis?

The prediction interval predicts in what range a future individual observation will fall, while a confidence interval shows the likely range of values associated with some statistical parameter of the data, such as the population mean.


1 Answers

OK, here's a shot at this (withouth prediction band, though). First of all you want to select the applicable data:

threshold = 0.02
reg_x = np.log10(x)[data.mean(1)>threshold]
reg_y = data.mean(1)[data.mean(1)>threshold]

Then you choose a model and perform a fit. Note, here I chose a second order polynomial but in principle you could do anything. For the fits I use kapteyn, this has a built-in confidence bans method, although it would be straightforward to implement (see e.g. Delta method)

from kapteyn import kmpfit

# Set model to fit.
def model(p, x):
    a, b, c = p
    return a + b*x + c*x**2

# Perform fit.
f = kmpfit.simplefit(model, [.1, .1, .1], reg_x, reg_y)

f contains all the estimated parameters and such, you can use that for plotting etc.

x = np.linspace(0, 3, 100)
plt.plot(x, model(f.params, x), linestyle='-', color='black', marker='')

For the confidence bands, we need the partial derivatives of the model with respect to the parameters (yes, some math). Again, this is easy for a polynomial model, shouldn't be a problem for any other model either.

# Partial derivatives:
dfdp = [1., reg_x, reg_x**2]
_, ci_upper, ci_lower = f.confidence_band(reg_x, dfdp, 0.95, model)

# Plot.
plt.plot(reg_x, ci_upper, linestyle='--', color='black', marker='')
plt.plot(reg_x, ci_lower, linestyle='--', color='black', marker='')

Unfortunately there is not prediction_bands() routine in the package, at least not that I know of. Assume you found some method for the prediction band, the plotting and preparation would look the same though..

p_upper, p_lower = prediction_band(*args, **kwargs)
plt.fill_between(reg_x, p_upper, p_lower, facecolor='blue', alpha=0.2, linestyle='')

Hope this helps, L.

like image 91
rammelmueller Avatar answered Sep 23 '22 22:09

rammelmueller