My question involves statistics and python and I am a beginner in both. I am running a simulation, and for each value for the independent variable (X) I produce 1000 values for the dependent variable (Y). What I have done is that I calculated the average of Y for each value of X and fitted these averages using scipy.optimize.curve_fit. The curve fits nicely, but I want to draw also the confidence intervals. I am not sure if what I am doing is correct or if what I want to do can be done, but my question is how can I get the confidence intervals from the covariance matrix produced by curve_fit. The code reads the averages from files first then it just simply uses curve_fit.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def readTDvsTx(L, B, P, fileformat):
# L should be '_Fixed_' or '_'
TD = []
infile = open(fileformat.format(L, B, P), 'r')
infile.readline() # To remove header
for line in infile:
l = line.split() # each line contains TxR followed by CD followed by TD
if eval(l[0]) >= 70 and eval(l[0]) <=190:
td = eval(l[2])
TD.append(td)
infile.close()
tdArray = np.array(TD)
return tdArray
def rec(x, a, b):
return a * (1 / (x**2)) + b
fileformat = 'Densities_file{}BS{}_PRNTS{}.txt'
txR = np.array(range(70, 200, 20))
parents = np.array(range(1,6))
disc_p1 = readTDvsTx('_Fixed_', 5, 1, fileformat)
popt, pcov = curve_fit(rec, txR, disc_p1)
plt.plot(txR, rec(txR, popt[0], popt[1]), 'r-')
plt.plot(txR, disc_p1, '.')
print(popt)
plt.show()
And here is the resulting fit:
For a 95% confidence interval, we use z=1.96, while for a 90% confidence interval, for example, we use z=1.64. Pr(−z<Z<z)=C100,whe re Zd=N(0,1).
A confidence interval is an interval that contains the population parameter with probability 1−α .
Here's a quick and wrong answer: you can approximate the errors from the covariance matrix for your a
and b
parameters as the square root of its diagonals: np.sqrt(np.diagonal(pcov))
. The parameter uncertainties can then be used to draw the confidence intervals.
The answer is wrong because you before you fit your data to a model, you'll need an estimate of the errors on your averaged disc_p1
points. When averaging, you have lost the information about the scatter of the population, leading curve_fit
to believe that the y-points you feed it are absolute and undisputable. This might cause an underestimation of your parameter errors.
For an estimate of the uncertainties of your averaged Y values, you need to estimate their dispersion measure and pass it along to curve_fit
while saying that your errors are absolute. Below is an example of how to do this for a random dataset where each of your points consists of a 1000 samples drawn from a normal distribution.
from scipy.optimize import curve_fit
import matplotlib.pylab as plt
import numpy as np
# model function
func = lambda x, a, b: a * (1 / (x**2)) + b
# approximating OP points
n_ypoints = 7
x_data = np.linspace(70, 190, n_ypoints)
# approximating the original scatter in Y-data
n_nested_points = 1000
point_errors = 50
y_data = [func(x, 4e6, -100) + np.random.normal(x, point_errors,
n_nested_points) for x in x_data]
# averages and dispersion of data
y_means = np.array(y_data).mean(axis = 1)
y_spread = np.array(y_data).std(axis = 1)
best_fit_ab, covar = curve_fit(func, x_data, y_means,
sigma = y_spread,
absolute_sigma = True)
sigma_ab = np.sqrt(np.diagonal(covar))
from uncertainties import ufloat
a = ufloat(best_fit_ab[0], sigma_ab[0])
b = ufloat(best_fit_ab[1], sigma_ab[1])
text_res = "Best fit parameters:\na = {}\nb = {}".format(a, b)
print(text_res)
# plotting the unaveraged data
flier_kwargs = dict(marker = 'o', markerfacecolor = 'silver',
markersize = 3, alpha=0.7)
line_kwargs = dict(color = 'k', linewidth = 1)
bp = plt.boxplot(y_data, positions = x_data,
capprops = line_kwargs,
boxprops = line_kwargs,
whiskerprops = line_kwargs,
medianprops = line_kwargs,
flierprops = flier_kwargs,
widths = 5,
manage_ticks = False)
# plotting the averaged data with calculated dispersion
#plt.scatter(x_data, y_means, facecolor = 'silver', alpha = 1)
#plt.errorbar(x_data, y_means, y_spread, fmt = 'none', ecolor = 'black')
# plotting the model
hires_x = np.linspace(50, 190, 100)
plt.plot(hires_x, func(hires_x, *best_fit_ab), 'black')
bound_upper = func(hires_x, *(best_fit_ab + sigma_ab))
bound_lower = func(hires_x, *(best_fit_ab - sigma_ab))
# plotting the confidence intervals
plt.fill_between(hires_x, bound_lower, bound_upper,
color = 'black', alpha = 0.15)
plt.text(140, 800, text_res)
plt.xlim(40, 200)
plt.ylim(0, 1000)
plt.show()
Edit: If you are not considering the intrinsic errors on the data points, you are probably fine with using the "qiuck and wrong" case I mentioned before. The square root of the diagonal entries of covariance matrix can then be used to calculate your confidence intervals. However, note that the confidence intervals have shrunk now that we've dropped the uncertainties:
from scipy.optimize import curve_fit
import matplotlib.pylab as plt
import numpy as np
func = lambda x, a, b: a * (1 / (x**2)) + b
n_ypoints = 7
x_data = np.linspace(70, 190, n_ypoints)
y_data = np.array([786.31, 487.27, 341.78, 265.49,
224.76, 208.04, 200.22])
best_fit_ab, covar = curve_fit(func, x_data, y_data)
sigma_ab = np.sqrt(np.diagonal(covar))
# an easy way to properly format parameter errors
from uncertainties import ufloat
a = ufloat(best_fit_ab[0], sigma_ab[0])
b = ufloat(best_fit_ab[1], sigma_ab[1])
text_res = "Best fit parameters:\na = {}\nb = {}".format(a, b)
print(text_res)
plt.scatter(x_data, y_data, facecolor = 'silver',
edgecolor = 'k', s = 10, alpha = 1)
# plotting the model
hires_x = np.linspace(50, 200, 100)
plt.plot(hires_x, func(hires_x, *best_fit_ab), 'black')
bound_upper = func(hires_x, *(best_fit_ab + sigma_ab))
bound_lower = func(hires_x, *(best_fit_ab - sigma_ab))
# plotting the confidence intervals
plt.fill_between(hires_x, bound_lower, bound_upper,
color = 'black', alpha = 0.15)
plt.text(140, 630, text_res)
plt.xlim(60, 200)
plt.ylim(0, 800)
plt.show()
If you're unsure whether to include the absolute errors or how to estimate them in your case, you'd be better off asking for advice at Cross Validated, as Stack Overflow is mainly for discussion on implementations of regression methods and not for discussion on the underlying statistics.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With