We can create the ECDF with
import numpy as np
from statsmodels.distributions.empirical_distribution import ECDF
ecdf = ECDF([3, 3, 1, 4])
and obtain then ECDF at point with
ecdf(x)
However, what if I want to know the x for percentile 97.5% ?
From http://www.statsmodels.org/stable/generated/statsmodels.distributions.empirical_distribution.ECDF.html?highlight=ecdf
, it seems like not been implemented.
Is there any way to do this? Or any other libraries?
An empirical distribution function provides a way to model and sample cumulative probabilities for a data sample that does not fit a standard probability distribution. As such, it is sometimes called the empirical cumulative distribution function, or ECDF for short.
In order to plot the ECDF we first need to compute the cumulative values. For calculating we could use the Python's dc_stat_think package and import it as dcst. We can generate the values by calling the dcst class method ecdf( ) and save the generated values in x and y. Next, we can plot it using the matplotlib's plt.
The exponential distribution has probability density f(x) = e–x, x ≥ 0, and therefore the cumulative distribution is the integral of the density: F(x) = 1 – e–x. This function can be explicitly inverted by solving for x in the equation F(x) = u. The inverse CDF is x = –log(1–u).
In statistics, an empirical distribution function (commonly also called an empirical Cumulative Distribution Function, eCDF) is the distribution function associated with the empirical measure of a sample. This cumulative distribution function is a step function that jumps up by 1/n at each of the n data points.
Since the empirical CDF just places mass of 1/n at each data point, the 97.5th quantile is just the data point that is bigger than 97.5% of all the other points. To find this value, you can simply sort the data in ascending order and find the 0.975n-th largest value.
sample = [1, 5, 2, 10, -19, 4, 7, 2, 0, -1]
n = len(sample)
sort = sorted(sample)
print sort[int(n * 0.975)]
Which produces:
10
Since we remember than for discrete distributions (like the empirical cdf), the quantile function is defined here , we realize that we have to take the 0.975n-th (rounded up) largest value.
This is my suggestion. Linear interpolation because dfs are only effectively estimated from fairly large samples anyway. The interpolating line segments can be obtained because their endpoints occur at distinct values in the sample.
import statsmodels.distributions.empirical_distribution as edf
from scipy.interpolate import interp1d
import numpy as np
import matplotlib.pyplot as plt
sample = [1,4,2,6,5,5,3,3,5,7]
sample_edf = edf.ECDF(sample)
slope_changes = sorted(set(sample))
sample_edf_values_at_slope_changes = [ sample_edf(item) for item in slope_changes]
inverted_edf = interp1d(sample_edf_values_at_slope_changes, slope_changes)
x = np.linspace(0.1, 1)
y = inverted_edf(x)
plt.plot(x, y, 'ro', x, y, 'b-')
plt.show()
print ('97.5 percentile:', inverted_edf(0.975))
It produces the following output,
97.5 percentile: 6.75
and this graph.
numpy.quantile(x, q=.975)
will return the value along array x at which has ecdf 0.975.
Similarly, there is pandas.quantile(q=0.97)
for Series/DataFrames.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With