How to plot empirical cdf (ecdf)

People also ask

How do you calculate empirical cumulative distribution ECDF?

The EDF is calculated by ordering all of the unique observations in the data sample and calculating the cumulative probability for each as the number of observations less than or equal to a given observation divided by the total number of observations. As follows: EDF(x) = number of observations <= x / n.

If you like linspace and prefer one-liners, you can do:

plt.plot(np.sort(a), np.linspace(0, 1, len(a), endpoint=False))

Given my tastes, I almost always do:

# a is the data array
x = np.sort(a)
y = np.arange(len(x))/float(len(x))
plt.plot(x, y)

Which works for me even if there are >O(1e6) data values. If you really need to downsample I'd set

x = np.sort(a)[::down_sampling_step]

Edit to respond to comment/edit on why I use endpoint=False or the y as defined above. The following are some technical details.

The empirical CDF is usually formally defined as

CDF(x) = "number of samples <= x"/"number of samples"

in order to exactly match this formal definition you would need to use y = np.arange(1,len(x)+1)/float(len(x)) so that we get y = [1/N, 2/N ... 1]. This estimator is an unbiased estimator that will converge to the true CDF in the limit of infinite samples Wikipedia ref..

I tend to use y = [0, 1/N, 2/N ... (N-1)/N] since (a) it is easier to code/more idiomatic, (b) but is still formally justified since one can always exchange CDF(x) with 1-CDF(x) in the convergence proof, and (c) works with the (easy) downsampling method described above.

In some particular cases, it is useful to define

y = (arange(len(x))+0.5)/len(x)

which is intermediate between these two conventions. Which, in effect, says "there is a 1/(2N) chance of a value less than the lowest one I've seen in my sample, and a 1/(2N) chance of a value greater than the largest one I've seen so far.

Note that the selection of this convention interacts with the where parameter used in the plt.step if it seems more useful to display the CDF as a piecewise constant function. In order to exactly match the formal definition mentioned above, one would need to use where=pre the suggested y=[0,1/N..., 1-1/N] convention, or where=post with the y=[1/N, 2/N ... 1] convention, but not the other way around.

However, for large samples, and reasonable distributions, the convention is given in the main body of the answer is easy to write, is an unbiased estimator of the true CDF, and works with the downsampling methodology.

You can use the ECDF function from the scikits.statsmodels library:

import numpy as np
import scikits.statsmodels as sm
import matplotlib.pyplot as plt

sample = np.random.uniform(0, 1, 50)
ecdf = sm.tools.ECDF(sample)

x = np.linspace(min(sample), max(sample))
y = ecdf(x)
plt.step(x, y)

With version 0.4 scicits.statsmodels was renamed to statsmodels. ECDF is now located in the distributions module (while statsmodels.tools.tools.ECDF is depreciated).

import numpy as np
import statsmodels.api as sm # recommended import according to the docs
import matplotlib.pyplot as plt

sample = np.random.uniform(0, 1, 50)
ecdf = sm.distributions.ECDF(sample)

x = np.linspace(min(sample), max(sample))
y = ecdf(x)
plt.step(x, y)
plt.show()

That looks to be (almost) exactly what you want. Two things:

First, the results are a tuple of four items. The third is the size of the bins. The second is the starting point of the smallest bin. The first is the number of points in the in or below each bin. (The last is the number of points outside the limits, but since you haven't set any, all points will be binned.)

Second, you'll want to rescale the results so the final value is 1, to follow the usual conventions of a CDF, but otherwise it's right.

Here's what it does under the hood:

def cumfreq(a, numbins=10, defaultreallimits=None):
    # docstring omitted
    h,l,b,e = histogram(a,numbins,defaultreallimits)
    cumhist = np.cumsum(h*1, axis=0)
    return cumhist,l,b,e

It does the histogramming, then produces a cumulative sum of the counts in each bin. So the ith value of the result is the number of array values less than or equal to the the maximum of the ith bin. So, the final value is just the size of the initial array.

Finally, to plot it, you'll need to use the initial value of the bin, and the bin size to determine what x-axis values you'll need.

Another option is to use numpy.histogram which can do the normalization and returns the bin edges. You'll need to do the cumulative sum of the resulting counts yourself.

a = array([...]) # your array of numbers
num_bins = 20
counts, bin_edges = numpy.histogram(a, bins=num_bins, normed=True)
cdf = numpy.cumsum(counts)
pylab.plot(bin_edges[1:], cdf)

(bin_edges[1:] is the upper edge of each bin.)

Have you tried the cumulative=True argument to pyplot.hist?

One-liner based on Dave's answer:

plt.plot(np.sort(arr), np.linspace(0, 1, len(arr), endpoint=False))

Edit: this was also suggested by hans_meine in the comments.

Related questions
                            
                                Pycharm error Django is not importable in this environment
                            
                                in python how do I convert a single digit number into a double digits string?
                            
                                How to use enums as a choice field in django model
                            
                                How to generate audio from a numpy array?
                            
                                High-precision clock in Python
                            
                                Importing PNG files into Numpy?
                            
                                How to copy a dict and modify it in one line of code
                            
                                How to make a movie out of images in python
                            
                                RuntimeError: There is no current event loop in thread in async + apscheduler
                            
                                Average values in two Numpy arrays
                            
                                plotting different colors in matplotlib [duplicate]
                            
                                How to use Django to get the name for the host server?
                            
                                virtualenv in PowerShell?
                            
                                How to use export with Python on Linux
                            
                                Python, why elif keyword? [closed]
                            
                                Increase distance between title and plot in matplolib? [duplicate]
                            
                                How to implement an efficient infinite generator of prime numbers in Python?
                            
                                How to add or increment a dictionary entry?
                            
                                Pymysql Insert Into not working
                            
                                removing emojis from a string in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to plot empirical cdf (ecdf)

Tags:

python

matplotlib

numpy

statistics

scipy

People also ask

Recent Activity

Donate For Us