Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Empirical distribution function (ecdf) implementation

I am aware of statsmodels.tools.tools.ECDF but since the calculation of an empricial cumulative distribution function (ECDF) is pretty straight-forward and I want to minimise dependencies in my project, I want to code it manually.

In a given list() / np.array() Pandas.Series, the ECDF for each element can be calculated as given in Wikipedia:

enter image description here

I have the Pandas DataFrame ,dfser, below and I want to get the ecdf of the values column. My two one-liner solutions are given as well.

Is there a faster way to do this? Speed is important in my application.

# Note that in my case indices are unique identifiers so I cannot reset them.
import numpy as np
import pandas as pd

# all indices are unique, but there may be duplicate measurement values (that belong to different indices). 
dfser = pd.DataFrame({'group':['a','b','b','a','d','c','e','e','c','a','b','d','d','c','d','e','e','a'],
                      'values':[2.01899E-06, 1.12186E-07, 8.97467E-07, 2.91257E-06, 1.93733E-05, 
                                0.00017889, 0.000120963, 4.27643E-07, 3.33614E-07, 2.08352E-12,  
                                1.39478E-05, 4.28255E-08, 9.7619E-06, 8.51787E-09, 1.28344E-09, 
                                3.5063E-05, 0.01732035,2.08352E-12]},
                       index = [123, 532, 235, 645, 747, 856, 345, 245, 845, 248, 901, 712, 162, 126, 
                              198,748, 127,395]      )

# My 1st Solution - list comprehension
dfser['ecdf']=[sum( dfser['values'] <= x)/float(dfser['values'].size) for x in dfser['values']]

# My 2nd Solution - ranking
dfser['rank'] = dfser['values'].rank(ascending = 0)
dfser['ecdf_r']=(len(dfser)-dfser['rank']+1)/len(dfser)
dfser
    group        values      ecdf  rank    ecdf_r
123     a  2.018990e-06  0.555556   9.0  0.555556
532     b  1.121860e-07  0.333333  13.0  0.333333
235     b  8.974670e-07  0.500000  10.0  0.500000
645     a  2.912570e-06  0.611111   8.0  0.611111
747     d  1.937330e-05  0.777778   5.0  0.777778
856     c  1.788900e-04  0.944444   2.0  0.944444
345     e  1.209630e-04  0.888889   3.0  0.888889
245     e  4.276430e-07  0.444444  11.0  0.444444
845     c  3.336140e-07  0.388889  12.0  0.388889
248     a  2.083520e-12  0.111111  17.5  0.083333
901     b  1.394780e-05  0.722222   6.0  0.722222
712     d  4.282550e-08  0.277778  14.0  0.277778
162     d  9.761900e-06  0.666667   7.0  0.666667
126     c  8.517870e-09  0.222222  15.0  0.222222
198     d  1.283440e-09  0.166667  16.0  0.166667
748     e  3.506300e-05  0.833333   4.0  0.833333
127     e  1.732035e-02  1.000000   1.0  1.000000
395     a  2.083520e-12  0.111111  17.5  0.083333
like image 842
Zhubarb Avatar asked Feb 19 '14 17:02

Zhubarb


People also ask

Can I use an empirical distribution function in Python?

It is a good case for using an empirical distribution function. An empirical distribution function can be fit for a data sample in Python. The statmodels Python library provides the ECDF class for fitting an empirical cumulative distribution function and calculating the cumulative probabilities for specific observations from the domain.

What is ECDF in Statmodels Python?

The statmodels Python library provides the ECDF class for fitting an empirical cumulative distribution function and calculating the cumulative probabilities for specific observations from the domain. The distribution is fit by calling ECDF () and passing in the raw data sample. ...

Why is ECDF a cumulative distribution function?

It is cumulative distribution function because it gives us the probability that variable will take a value less than or equal to specific value of the variable. In an ECDF, x-axis correspond to the range of values for variables and on the y-axis we plot the proportion of data points that are less than are equal to corresponding x-axis value.

Is there an alternative for ECDF () function of R in Python?

Then the empirical distribution function is defined as: Source Coming to my point, it is really hard to find an alternative for ecdf () function of R in Python. There are few online codes available, but this is verified as the best possible match to the R's ecdf () function.


Video Answer


1 Answers

Since you are already using pandas I think it will be silly not to use some of its features:

In [15]:
import numpy as np
from numpy import *
sq=ser.value_counts()
sq.sort_index().cumsum()*1./len(sq)
Out[15]:
2.083520e-12    0.058824
1.283440e-09    0.117647
8.517870e-09    0.176471
4.282550e-08    0.235294
1.121860e-07    0.294118
3.336140e-07    0.352941
4.276430e-07    0.411765
8.974670e-07    0.470588
2.018990e-06    0.529412
2.912570e-06    0.588235
9.761900e-06    0.647059
1.394780e-05    0.705882
1.937330e-05    0.764706
3.506300e-05    0.823529
1.209630e-04    0.882353
1.788900e-04    0.941176
1.732035e-02    1.000000
dtype: float64

And speed comparison

In [19]:

%timeit sq.sort_index().cumsum()*1./len(sq)
1000 loops, best of 3: 344 µs per loop
In [18]:

%timeit ser.value_counts().sort_index().cumsum()*1./len(ser.value_counts())
1000 loops, best of 3: 1.58 ms per loop
In [17]:

%timeit [sum( ser <= x)/float(len(ser)) for x in ser]
100 loops, best of 3: 3.31 ms per loop

If the values are all unique, the ser.value_counts() is no longer needed. That part is slow (Fetching unique values). All you need in that case is just to sort ser.

In [23]:

%timeit np.arange(1, ser.size+1)/float(ser.size)
10000 loops, best of 3: 11.6 µs per loop

The fastest version that I can think of is to use get vectorized:

In [35]:

np.sum(dfser['values'].values[...,newaxis]<=dfser['values'].values.reshape((1,-1)), axis=0)*1./dfser['values'].size
Out[35]:
array([ 0.55555556,  0.33333333,  0.5       ,  0.61111111,  0.77777778,
        0.94444444,  0.88888889,  0.44444444,  0.38888889,  0.11111111,
        0.72222222,  0.27777778,  0.66666667,  0.22222222,  0.16666667,
        0.83333333,  1.        ,  0.11111111])

Add let see:

In [37]:

%timeit dfser['ecdf']=[sum( dfser['values'] <= x)/float(dfser['values'].size) for x in dfser['values']]
100 loops, best of 3: 6 ms per loop
In [38]:

%%timeit
dfser['rank'] = dfser['values'].rank(ascending = 0)
dfser['ecdf_r']=(len(dfser)-dfser['rank']+1)/len(dfser)
1000 loops, best of 3: 827 µs per loop
In [39]:

%timeit np.sum(dfser['values'].values[...,newaxis]<=dfser['values'].values.reshape((1,-1)), axis=0)*1./dfser['values'].size
10000 loops, best of 3: 91.1 µs per loop
like image 164
CT Zhu Avatar answered Sep 23 '22 13:09

CT Zhu