Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: inverse empirical cumulative distribution function (ECDF)?

We can create the ECDF with

import numpy as np
from statsmodels.distributions.empirical_distribution import ECDF
ecdf = ECDF([3, 3, 1, 4])

and obtain then ECDF at point with

ecdf(x)

However, what if I want to know the x for percentile 97.5% ?

From http://www.statsmodels.org/stable/generated/statsmodels.distributions.empirical_distribution.ECDF.html?highlight=ecdf, it seems like not been implemented.

Is there any way to do this? Or any other libraries?

like image 318
cqcn1991 Avatar asked May 23 '17 10:05

cqcn1991


People also ask

What is ECDF function in Python?

An empirical distribution function provides a way to model and sample cumulative probabilities for a data sample that does not fit a standard probability distribution. As such, it is sometimes called the empirical cumulative distribution function, or ECDF for short.

How do you plot empirical cumulative distribution in Python?

In order to plot the ECDF we first need to compute the cumulative values. For calculating we could use the Python's dc_stat_think package and import it as dcst. We can generate the values by calling the dcst class method ecdf( ) and save the generated values in x and y. Next, we can plot it using the matplotlib's plt.

How do you find the inverse of a CDF?

The exponential distribution has probability density f(x) = e–x, x ≥ 0, and therefore the cumulative distribution is the integral of the density: F(x) = 1 – e–x. This function can be explicitly inverted by solving for x in the equation F(x) = u. The inverse CDF is x = –log(1–u).

What is ECDF in statistics?

In statistics, an empirical distribution function (commonly also called an empirical Cumulative Distribution Function, eCDF) is the distribution function associated with the empirical measure of a sample. This cumulative distribution function is a step function that jumps up by 1/n at each of the n data points.


3 Answers

Since the empirical CDF just places mass of 1/n at each data point, the 97.5th quantile is just the data point that is bigger than 97.5% of all the other points. To find this value, you can simply sort the data in ascending order and find the 0.975n-th largest value.

sample = [1, 5, 2, 10, -19, 4, 7, 2, 0, -1]
n = len(sample)
sort = sorted(sample)
print sort[int(n * 0.975)]

Which produces:

10

Since we remember than for discrete distributions (like the empirical cdf), the quantile function is defined here , we realize that we have to take the 0.975n-th (rounded up) largest value.

like image 149
Benjamin Doughty Avatar answered Oct 20 '22 22:10

Benjamin Doughty


This is my suggestion. Linear interpolation because dfs are only effectively estimated from fairly large samples anyway. The interpolating line segments can be obtained because their endpoints occur at distinct values in the sample.

import statsmodels.distributions.empirical_distribution as edf
from scipy.interpolate import interp1d
import numpy as np
import matplotlib.pyplot as plt

sample = [1,4,2,6,5,5,3,3,5,7]
sample_edf = edf.ECDF(sample)

slope_changes = sorted(set(sample))

sample_edf_values_at_slope_changes = [ sample_edf(item) for item in slope_changes]
inverted_edf = interp1d(sample_edf_values_at_slope_changes, slope_changes)

x = np.linspace(0.1, 1)
y = inverted_edf(x)
plt.plot(x, y, 'ro', x, y, 'b-')
plt.show()

print ('97.5 percentile:', inverted_edf(0.975))

It produces the following output,

97.5 percentile: 6.75

and this graph. inverted empirical cdf

like image 33
Bill Bell Avatar answered Oct 20 '22 22:10

Bill Bell


numpy.quantile(x, q=.975) will return the value along array x at which has ecdf 0.975.

Similarly, there is pandas.quantile(q=0.97) for Series/DataFrames.

like image 30
mathsmodel Avatar answered Oct 20 '22 22:10

mathsmodel