Calculate the Cumulative Distribution Function (CDF) in Python

Q: How CDF is calculated?

The cumulative distribution function (CDF) of random variable X is defined as FX(x) = P(X ≤ x), for all x ∈ R. Note that the subscript X indicates that this is the CDF of the random variable X.

2 Answers

(It is possible that my interpretation of the question is wrong. If the question is how to get from a discrete PDF into a discrete CDF, then np.cumsum divided by a suitable constant will do if the samples are equispaced. If the array is not equispaced, then np.cumsum of the array multiplied by the distances between the points will do.)

If you have a discrete array of samples, and you would like to know the CDF of the sample, then you can just sort the array. If you look at the sorted result, you'll realize that the smallest value represents 0% , and largest value represents 100 %. If you want to know the value at 50 % of the distribution, just look at the array element which is in the middle of the sorted array.

Let us have a closer look at this with a simple example:

import matplotlib.pyplot as plt import numpy as np  # create some randomly ddistributed data: data = np.random.randn(10000)  # sort the data: data_sorted = np.sort(data)  # calculate the proportional values of samples p = 1. * np.arange(len(data)) / (len(data) - 1)  # plot the sorted data: fig = plt.figure() ax1 = fig.add_subplot(121) ax1.plot(p, data_sorted) ax1.set_xlabel('$p$') ax1.set_ylabel('$x$')  ax2 = fig.add_subplot(122) ax2.plot(data_sorted, p) ax2.set_xlabel('$x$') ax2.set_ylabel('$p$')

This gives the following plot where the right-hand-side plot is the traditional cumulative distribution function. It should reflect the CDF of the process behind the points, but naturally, it is not as long as the number of points is finite.

cumulative distribution function

This function is easy to invert, and it depends on your application which form you need.

answered Sep 24 '22 14:09

DrV

Assuming you know how your data is distributed (i.e. you know the pdf of your data), then scipy does support discrete data when calculating cdf's

import numpy as np import scipy import matplotlib.pyplot as plt import seaborn as sns  x = np.random.randn(10000) # generate samples from normal distribution (discrete data) norm_cdf = scipy.stats.norm.cdf(x) # calculate the cdf - also discrete  # plot the cdf sns.lineplot(x=x, y=norm_cdf) plt.show()

enter image description here

We can even print the first few values of the cdf to show they are discrete

print(norm_cdf[:10]) >>> array([0.39216484, 0.09554546, 0.71268696, 0.5007396 , 0.76484329,        0.37920836, 0.86010018, 0.9191937 , 0.46374527, 0.4576634 ])

The same method to calculate the cdf also works for multiple dimensions: we use 2d data below to illustrate

mu = np.zeros(2) # mean vector cov = np.array([[1,0.6],[0.6,1]]) # covariance matrix # generate 2d normally distributed samples using 0 mean and the covariance matrix above x = np.random.multivariate_normal(mean=mu, cov=cov, size=1000) # 1000 samples norm_cdf = scipy.stats.norm.cdf(x) print(norm_cdf.shape) >>> (1000, 2)

In the above examples, I had prior knowledge that my data was normally distributed, which is why I used scipy.stats.norm() - there are multiple distributions scipy supports. But again, you need to know how your data is distributed beforehand to use such functions. If you don't know how your data is distributed and you just use any distribution to calculate the cdf, you most likely will get incorrect results.

answered Sep 23 '22 14:09

PyRsquared

Related questions
                            
                                Selenium Element not visible exception
                            
                                Dropping duplicates in Pandas excluding one column
                            
                                unpacking an array of arguments in php
                            
                                Huge memory usage of loading large dictionaries in memory
                            
                                Python regex string matching?
                            
                                Simultaneously replacing all values of a dictionary to zero python
                            
                                Python Gensim: how to calculate document similarity using the LDA model?
                            
                                Managing Tweepy API Search
                            
                                How do I make sans serif superscript or subscript text in matplotlib?
                            
                                Invalid parameter for sklearn estimator pipeline
                            
                                Python: Difference between kwargs.pop() and kwargs.get()
                            
                                How to calculate a Fourier series in Numpy?
                            
                                Python matplotlib superimpose scatter plots
                            
                                Return max of zero or value for a pandas DataFrame column
                            
                                How to disable special naming convention inspection of PEP 8 in PyCharm
                            
                                Removing header column from pandas dataframe
                            
                                How to drop columns which have same values in all rows via pandas or spark dataframe?
                            
                                How to access the last element in a Pandas series?
                            
                                Can I somehow "compile" a python script to work on PC without Python installed?
                            
                                Why does using `arg=None` fix Python's mutable default argument issue?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Calculate the Cumulative Distribution Function (CDF) in Python

Tags:

python

machine-learning

numpy

statistics

scipy

wizbcn

People also ask

2 Answers

DrV

PyRsquared

Recent Activity

Donate For Us