Is there a way to do this? I cannot seem an easy way to interface pandas series with plotting a CDF.

In case you are also interested in the values, not just the plot. <pre class="prettyprint lang-py prettyprint-override"><code>import pandas as pd # If you are in jupyter %matplotlib inline </code></pre> <h3>This will always work (discrete and continuous distributions)</h3> <pre class="prettyprint lang-py prettyprint-override"><code># Define your series s = pd.Series([9, 5, 3, 5, 5, 4, 6, 5, 5, 8, 7], name = 'value') df = pd.DataFrame(s) </code></pre> <pre class="prettyprint lang-py prettyprint-override"><code># Get the frequency, PDF and CDF for each value in the series # Frequency stats_df = df \ .groupby('value') \ ['value'] \ .agg('count') \ .pipe(pd.DataFrame) \ .rename(columns = {'value': 'frequency'}) # PDF stats_df['pdf'] = stats_df['frequency'] / sum(stats_df['frequency']) # CDF stats_df['cdf'] = stats_df['pdf'].cumsum() stats_df = stats_df.reset_index() stats_df </code></pre> <img src="https://i.stack.imgur.com/eXZa0.png" alt="enter image description here"> <pre class="prettyprint lang-py prettyprint-override"><code># Plot the discrete Probability Mass Function and CDF. # Technically, the 'pdf label in the legend and the table the should be 'pmf' # (Probability Mass Function) since the distribution is discrete. # If you don't have too many values / usually discrete case stats_df.plot.bar(x = 'value', y = ['pdf', 'cdf'], grid = True) </code></pre> <img src="https://i.stack.imgur.com/4b5Oa.png" alt="enter image description here"> Alternative example with a sample drawn from a continuous distribution or you have a lot of individual values: <pre class="prettyprint lang-py prettyprint-override"><code># Define your series s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value') </code></pre> <pre class="prettyprint lang-py prettyprint-override"><code># ... all the same calculation stuff to get the frequency, PDF, CDF </code></pre> <pre class="prettyprint lang-py prettyprint-override"><code># Plot stats_df.plot(x = 'value', y = ['pdf', 'cdf'], grid = True) </code></pre> <img src="https://i.stack.imgur.com/l8qhF.png" alt="enter image description here"> <h3>For continuous distributions only</h3> Please note if it is very reasonable to make the assumption that there is only one occurence of each value in the sample (typically encountered in the case of continuous distributions) then the <code>groupby()</code> + <code>agg('count')</code> is not necessary (since the count is always 1). In this case, a percent rank can be used to get to the cdf directly. Use your best judgment when taking this kind of shortcut! :) <pre class="prettyprint lang-py prettyprint-override"><code># Define your series s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value') df = pd.DataFrame(s) </code></pre> <pre class="prettyprint lang-py prettyprint-override"><code># Get to the CDF directly df['cdf'] = df.rank(method = 'average', pct = True) </code></pre> <pre class="prettyprint lang-py prettyprint-override"><code># Sort and plot df.sort_values('value').plot(x = 'value', y = 'cdf', grid = True) </code></pre> <img src="https://i.stack.imgur.com/qySHB.png" alt="enter image description here">

Plotting CDF of a pandas series in python

2 Answers

I believe the functionality you're looking for is in the hist method of a Series object which wraps the hist() function in matplotlib

Here's the relevant documentation

In [10]: import matplotlib.pyplot as plt  In [11]: plt.hist? ... Plot a histogram.  Compute and draw the histogram of *x*. The return value is a tuple (*n*, *bins*, *patches*) or ([*n0*, *n1*, ...], *bins*, [*patches0*, *patches1*,...]) if the input contains multiple data. ... cumulative : boolean, optional, default : False     If `True`, then a histogram is computed where each bin gives the     counts in that bin plus all bins for smaller values. The last bin     gives the total number of datapoints.  If `normed` is also `True`     then the histogram is normalized such that the last bin equals 1.     If `cumulative` evaluates to less than 0 (e.g., -1), the direction     of accumulation is reversed.  In this case, if `normed` is also     `True`, then the histogram is normalized such that the first bin     equals 1.  ...

For example

In [12]: import pandas as pd  In [13]: import numpy as np  In [14]: ser = pd.Series(np.random.normal(size=1000))  In [15]: ser.hist(cumulative=True, density=1, bins=100) Out[15]: <matplotlib.axes.AxesSubplot at 0x11469a590>  In [16]: plt.show()

answered Sep 28 '22 19:09

Dan Frank

In case you are also interested in the values, not just the plot.

import pandas as pd  # If you are in jupyter %matplotlib inline

This will always work (discrete and continuous distributions)

# Define your series s = pd.Series([9, 5, 3, 5, 5, 4, 6, 5, 5, 8, 7], name = 'value') df = pd.DataFrame(s)

# Get the frequency, PDF and CDF for each value in the series  # Frequency stats_df = df \ .groupby('value') \ ['value'] \ .agg('count') \ .pipe(pd.DataFrame) \ .rename(columns = {'value': 'frequency'})  # PDF stats_df['pdf'] = stats_df['frequency'] / sum(stats_df['frequency'])  # CDF stats_df['cdf'] = stats_df['pdf'].cumsum() stats_df = stats_df.reset_index() stats_df

enter image description here

# Plot the discrete Probability Mass Function and CDF. # Technically, the 'pdf label in the legend and the table the should be 'pmf' # (Probability Mass Function) since the distribution is discrete.  # If you don't have too many values / usually discrete case stats_df.plot.bar(x = 'value', y = ['pdf', 'cdf'], grid = True)

enter image description here

Alternative example with a sample drawn from a continuous distribution or you have a lot of individual values:

# Define your series s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')

# ... all the same calculation stuff to get the frequency, PDF, CDF

# Plot stats_df.plot(x = 'value', y = ['pdf', 'cdf'], grid = True)

enter image description here

For continuous distributions only

Please note if it is very reasonable to make the assumption that there is only one occurence of each value in the sample (typically encountered in the case of continuous distributions) then the groupby() + agg('count') is not necessary (since the count is always 1).

In this case, a percent rank can be used to get to the cdf directly.

Use your best judgment when taking this kind of shortcut! :)

# Define your series s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value') df = pd.DataFrame(s)

# Get to the CDF directly df['cdf'] = df.rank(method = 'average', pct = True)

# Sort and plot df.sort_values('value').plot(x = 'value', y = 'cdf', grid = True)

enter image description here

answered Sep 28 '22 20:09

Raphvanns

Related questions
                            
                                ImportError: No module named 'yaml'
                            
                                Setting up Vim for Python
                            
                                Does python optimize modules when they are imported multiple times?
                            
                                Get character position in alphabet
                            
                                How to call super method from grandchild class?
                            
                                Open Jupyter Notebook from a Drive Other than C Drive
                            
                                Print raw HTTP request in Flask or WSGI
                            
                                Python: Write unittest for console print
                            
                                How to make python on Heroku https only?
                            
                                Temporary failure in name resolution [Errno -3] with Docker
                            
                                Check if a process is running or not on Windows?
                            
                                How to trim a list in Python
                            
                                How to get the system info with Python?
                            
                                How to pipe input to python line by line from linux program?
                            
                                Python MySql Insert not working
                            
                                How to insert current_timestamp into Postgres via python
                            
                                On linux SUSE or RedHat, how do I load Python 2.7
                            
                                Why is Python not fully object-oriented? [closed]
                            
                                What is the difference between semicolons in JavaScript and in Python?
                            
                                Remove text between () and []

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Plotting CDF of a pandas series in python

Tags:

python

pandas

series

cdf

wolfsatthedoor

People also ask

2 Answers

Dan Frank

This will always work (discrete and continuous distributions)

For continuous distributions only

Raphvanns

Recent Activity

Donate For Us