Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Summing over pair of indices (or more) in Python

Tags:

python

numpy

sum

One way to calculate the Gini coefficient of a sample is using the relative mean difference (RMD) which is 2 times the Gini coefficient. RMD depends on the mean difference which is given by:

enter image description here

So I need to calculate each difference between pair of elements in a sample (yi - yj). It took me quite a bit to figure out a way to do it but I want to know if there is a function that does this for you.

At first I tried this but I bet it's very slow in big data sets (by the way, s is the sample):

In [124]:

%%timeit
from itertools import permutations
k = 0
for i, j in list(permutations(s,2)):
    k += abs(i-j)
MD = k/float(len(s)**2)
G = MD / float(mean(s))
G = G/2
G
10000 loops, best of 3: 78 us per loop

Then I tried the following which is less understandable but quicker:

In [126]:
%%timeit
m = abs(s - s.reshape(len(s), 1))
MD = np.sum(m)/float((len(s)**2))
G = MD / float(mean(s))
G = G/2
G
10000 loops, best of 3: 46.8 us per loop

Is there something efficient but easy to generalize? For example, what if I want to sum over three indices?

This is the sample I was using:

sample = array([5487574374,     686306,    5092789,   17264231,   41733014,
         60870152,   82204091,  227787612,  264942911,  716909668,
        679759369, 1336605253,  788028471,  331434695,  146295398,
         88673463,  224589748,  128576176,  346121028])

gini(sample)
Out[155]:
0.2692307692307692

Thanks!

like image 611
Robert Smith Avatar asked Dec 27 '12 00:12

Robert Smith


1 Answers

For the MD example you give, it can be exploited by sorting, You can achieve O(N*Log(N)) instead of O(N^2)

y = [2,3,2,34]

def slow(y):
    tot = 0
    for i in range(len(y)):
        for j in range(len(y)):
            if i != j:
                tot += abs(y[i] - y[j])
    return float(tot)/len(y)**2

print slow(y)

def fast(y):
    sorted_y = sorted(y)
    tot = 0
    for i, yi in enumerate(sorted_y):
        smaller = i
        bigger = len(y) - i - 1
        tot += smaller * yi - bigger * yi
    return float(2*tot)/len(y)**2

print fast(y)

Often you will have to use dynamic programming or other techniques to make these faster, I'm not sure if there is a "one method fits all" solution.

like image 124
robert king Avatar answered Oct 11 '22 11:10

robert king