Python Empirical distribution function (ecdf) implementation

Tags:

I am aware of statsmodels.tools.tools.ECDF but since the calculation of an empricial cumulative distribution function (ECDF) is pretty straight-forward and I want to minimise dependencies in my project, I want to code it manually.

In a given list() / np.array() Pandas.Series, the ECDF for each element can be calculated as given in Wikipedia:

enter image description here

I have the Pandas DataFrame ,dfser, below and I want to get the ecdf of the values column. My two one-liner solutions are given as well.

Is there a faster way to do this? Speed is important in my application.

# Note that in my case indices are unique identifiers so I cannot reset them.
import numpy as np
import pandas as pd

# all indices are unique, but there may be duplicate measurement values (that belong to different indices). 
dfser = pd.DataFrame({'group':['a','b','b','a','d','c','e','e','c','a','b','d','d','c','d','e','e','a'],
                      'values':[2.01899E-06, 1.12186E-07, 8.97467E-07, 2.91257E-06, 1.93733E-05, 
                                0.00017889, 0.000120963, 4.27643E-07, 3.33614E-07, 2.08352E-12,  
                                1.39478E-05, 4.28255E-08, 9.7619E-06, 8.51787E-09, 1.28344E-09, 
                                3.5063E-05, 0.01732035,2.08352E-12]},
                       index = [123, 532, 235, 645, 747, 856, 345, 245, 845, 248, 901, 712, 162, 126, 
                              198,748, 127,395]      )

# My 1st Solution - list comprehension
dfser['ecdf']=[sum( dfser['values'] <= x)/float(dfser['values'].size) for x in dfser['values']]

# My 2nd Solution - ranking
dfser['rank'] = dfser['values'].rank(ascending = 0)
dfser['ecdf_r']=(len(dfser)-dfser['rank']+1)/len(dfser)
dfser
    group        values      ecdf  rank    ecdf_r
123     a  2.018990e-06  0.555556   9.0  0.555556
532     b  1.121860e-07  0.333333  13.0  0.333333
235     b  8.974670e-07  0.500000  10.0  0.500000
645     a  2.912570e-06  0.611111   8.0  0.611111
747     d  1.937330e-05  0.777778   5.0  0.777778
856     c  1.788900e-04  0.944444   2.0  0.944444
345     e  1.209630e-04  0.888889   3.0  0.888889
245     e  4.276430e-07  0.444444  11.0  0.444444
845     c  3.336140e-07  0.388889  12.0  0.388889
248     a  2.083520e-12  0.111111  17.5  0.083333
901     b  1.394780e-05  0.722222   6.0  0.722222
712     d  4.282550e-08  0.277778  14.0  0.277778
162     d  9.761900e-06  0.666667   7.0  0.666667
126     c  8.517870e-09  0.222222  15.0  0.222222
198     d  1.283440e-09  0.166667  16.0  0.166667
748     e  3.506300e-05  0.833333   4.0  0.833333
127     e  1.732035e-02  1.000000   1.0  1.000000
395     a  2.083520e-12  0.111111  17.5  0.083333

842

asked Feb 19 '14 17:02

Zhubarb

Video Answer

1 Answers

Since you are already using pandas I think it will be silly not to use some of its features:

In [15]:
import numpy as np
from numpy import *
sq=ser.value_counts()
sq.sort_index().cumsum()*1./len(sq)
Out[15]:
2.083520e-12    0.058824
1.283440e-09    0.117647
8.517870e-09    0.176471
4.282550e-08    0.235294
1.121860e-07    0.294118
3.336140e-07    0.352941
4.276430e-07    0.411765
8.974670e-07    0.470588
2.018990e-06    0.529412
2.912570e-06    0.588235
9.761900e-06    0.647059
1.394780e-05    0.705882
1.937330e-05    0.764706
3.506300e-05    0.823529
1.209630e-04    0.882353
1.788900e-04    0.941176
1.732035e-02    1.000000
dtype: float64

And speed comparison

In [19]:

%timeit sq.sort_index().cumsum()*1./len(sq)
1000 loops, best of 3: 344 µs per loop
In [18]:

%timeit ser.value_counts().sort_index().cumsum()*1./len(ser.value_counts())
1000 loops, best of 3: 1.58 ms per loop
In [17]:

%timeit [sum( ser <= x)/float(len(ser)) for x in ser]
100 loops, best of 3: 3.31 ms per loop

If the values are all unique, the ser.value_counts() is no longer needed. That part is slow (Fetching unique values). All you need in that case is just to sort ser.

In [23]:

%timeit np.arange(1, ser.size+1)/float(ser.size)
10000 loops, best of 3: 11.6 µs per loop

The fastest version that I can think of is to use get vectorized:

In [35]:

np.sum(dfser['values'].values[...,newaxis]<=dfser['values'].values.reshape((1,-1)), axis=0)*1./dfser['values'].size
Out[35]:
array([ 0.55555556,  0.33333333,  0.5       ,  0.61111111,  0.77777778,
        0.94444444,  0.88888889,  0.44444444,  0.38888889,  0.11111111,
        0.72222222,  0.27777778,  0.66666667,  0.22222222,  0.16666667,
        0.83333333,  1.        ,  0.11111111])

Add let see:

In [37]:

%timeit dfser['ecdf']=[sum( dfser['values'] <= x)/float(dfser['values'].size) for x in dfser['values']]
100 loops, best of 3: 6 ms per loop
In [38]:

%%timeit
dfser['rank'] = dfser['values'].rank(ascending = 0)
dfser['ecdf_r']=(len(dfser)-dfser['rank']+1)/len(dfser)
1000 loops, best of 3: 827 µs per loop
In [39]:

%timeit np.sum(dfser['values'].values[...,newaxis]<=dfser['values'].values.reshape((1,-1)), axis=0)*1./dfser['values'].size
10000 loops, best of 3: 91.1 µs per loop

164

answered Sep 23 '22 13:09

CT Zhu

Related questions
                            
                                grequests pool with multiple request.session?
                            
                                pandas much slower than numpy?
                            
                                Building an egg of my python project
                            
                                pyodbc does not throw on SQL Server error
                            
                                How to indicate that a function expects a function as a parameter, or returns a function, via function annotations?
                            
                                Object level permissions in Django admin
                            
                                python optparse, default values for optional options
                            
                                Pandas - Groupby and create new DataFrame?
                            
                                Can't reach Eve REST API
                            
                                Pandas Boolean Operation in a Python List
                            
                                What do I have to install to resolve Could not find any typelib for GtkSource, Cannot import: GtkSourceView, cannot import name GtkSource
                            
                                Paramiko: how to ensure data is received between commands
                            
                                List (or iterator) of tuples returned by MAP (PySpark)
                            
                                'instancemethod' object has no attribute '__getitem__' with class variables
                            
                                subtuples for a tuple
                            
                                error: command 'cc' failed with exit status 1 - MySQLdb installation on MAC
                            
                                Python3, best alternatives to eval()? (input as an expression)
                            
                                Scrolling down a page with Selenium Webdriver
                            
                                Docopt: options after repeating elements are interpeted as repeating elements
                            
                                How to write a DownloadHandler for scrapy that makes requests through socksipy?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python Empirical distribution function (ecdf) implementation

Tags:

python

pandas

numpy

ecdf

Zhubarb

People also ask

Video Answer

1 Answers

CT Zhu

Recent Activity

Donate For Us