Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

cumulative distribution plots python

I am doing a project using python where I have two arrays of data. Let's call them pc and pnc. I am required to plot a cumulative distribution of both of these on the same graph. For pc it is supposed to be a less than plot i.e. at (x,y), y points in pc must have value less than x. For pnc it is to be a more than plot i.e. at (x,y), y points in pnc must have value more than x.

I have tried using histogram function - pyplot.hist. Is there a better and easier way to do what i want? Also, it has to be plotted on a logarithmic scale on the x-axis.

like image 933
akhiljain Avatar asked Mar 14 '13 11:03

akhiljain


People also ask

What does CDF do in Python?

A cumulative distribution function (CDF) tells us the probability that a random variable takes on a value less than or equal to some value. This tutorial explains how to calculate and plot values for the normal CDF in Python.


2 Answers

You were close. You should not use plt.hist as numpy.histogram, that gives you both the values and the bins, than you can plot the cumulative with ease:

import numpy as np import matplotlib.pyplot as plt  # some fake data data = np.random.randn(1000) # evaluate the histogram values, base = np.histogram(data, bins=40) #evaluate the cumulative cumulative = np.cumsum(values) # plot the cumulative function plt.plot(base[:-1], cumulative, c='blue') #plot the survival function plt.plot(base[:-1], len(data)-cumulative, c='green')  plt.show() 

enter image description here

like image 119
EnricoGiampieri Avatar answered Oct 16 '22 04:10

EnricoGiampieri


Using histograms is really unnecessarily heavy and imprecise (the binning makes the data fuzzy): you can just sort all the x values: the index of each value is the number of values that are smaller. This shorter and simpler solution looks like this:

import numpy as np import matplotlib.pyplot as plt  # Some fake data: data = np.random.randn(1000)  sorted_data = np.sort(data)  # Or data.sort(), if data can be modified  # Cumulative counts: plt.step(sorted_data, np.arange(sorted_data.size))  # From 0 to the number of data points-1 plt.step(sorted_data[::-1], np.arange(sorted_data.size))  # From the number of data points-1 to 0  plt.show() 

Furthermore, a more appropriate plot style is indeed plt.step() instead of plt.plot(), since the data is in discrete locations.

The result is:

enter image description here

You can see that it is more ragged than the output of EnricoGiampieri's answer, but this one is the real histogram (instead of being an approximate, fuzzier version of it).

PS: As SebastianRaschka noted, the very last point should ideally show the total count (instead of the total count-1). This can be achieved with:

plt.step(np.concatenate([sorted_data, sorted_data[[-1]]]),          np.arange(sorted_data.size+1)) plt.step(np.concatenate([sorted_data[::-1], sorted_data[[0]]]),          np.arange(sorted_data.size+1)) 

There are so many points in data that the effect is not visible without a zoom, but the very last point at the total count does matter when the data contains only a few points.

like image 42
Eric O Lebigot Avatar answered Oct 16 '22 03:10

Eric O Lebigot