Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Gini Coefficient in Julia: Efficient and Accurate Code

I'm trying to implement the following formula in Julia for calculating the Gini coefficient of a wage distribution:

enter image description here

where enter image description here

Here's a simplified version of the code I'm using for this:

# Takes a array where first column is value of wages
# (y_i in formula), and second column is probability
# of wage value (f(y_i) in formula).
function gini(wagedistarray)
    # First calculate S values in formula
    for i in 1:length(wagedistarray[:,1])
        for j in 1:i
            Swages[i]+=wagedistarray[j,2]*wagedistarray[j,1]
        end
    end

    # Now calculate value to subtract from 1 in gini formula
    Gwages = Swages[1]*wagedistarray[1,2]
    for i in 2:length(Swages)
        Gwages += wagedistarray[i,2]*(Swages[i]+Swages[i-1])
    end

    # Final step of gini calculation
    return giniwages=1-(Gwages/Swages[length(Swages)])          
end

wagedistarray=zeros(10000,2)                                 
Swages=zeros(length(wagedistarray[:,1]))                    

for i in 1:length(wagedistarray[:,1])
   wagedistarray[i,1]=1
   wagedistarray[i,2]=1/10000
end


@time result=gini(wagedistarray)

It gives a value of near zero, which is what you expect for a completely equal wage distribution. However, it takes quite a long time: 6.796 secs.

Any ideas for improvement?

like image 424
David Zentler-Munro Avatar asked Jul 09 '15 15:07

David Zentler-Munro


1 Answers

Try this:

function gini(wagedistarray)
    nrows = size(wagedistarray,1)
    Swages = zeros(nrows)
    for i in 1:nrows
        for j in 1:i
            Swages[i] += wagedistarray[j,2]*wagedistarray[j,1]
        end
    end

    Gwages=Swages[1]*wagedistarray[1,2]
    for i in 2:nrows
        Gwages+=wagedistarray[i,2]*(Swages[i]+Swages[i-1])
    end

    return 1-(Gwages/Swages[length(Swages)])

end

wagedistarray=zeros(10000,2)
for i in 1:size(wagedistarray,1)
   wagedistarray[i,1]=1
   wagedistarray[i,2]=1/10000
end

@time result=gini(wagedistarray)
  • Time before: 5.913907256 seconds (4000481676 bytes allocated, 25.37% gc time)
  • Time after: 0.134799301 seconds (507260 bytes allocated)
  • Time after (second run): elapsed time: 0.123665107 seconds (80112 bytes allocated)

The primary problems are that Swages was a global variable (wasn't living in the function) which is not a good coding practice, but more importantly is a performance killer. The other thing I noticed was length(wagedistarray[:,1]), which makes a copy of that column and then asks its length - that was generating some extra "garbage". The second run is faster because there is some compilation time the very first time the function is run.

You crank performance even higher using @inbounds, i.e.

function gini(wagedistarray)
    nrows = size(wagedistarray,1)
    Swages = zeros(nrows)
    @inbounds for i in 1:nrows
        for j in 1:i
            Swages[i] += wagedistarray[j,2]*wagedistarray[j,1]
        end
    end

    Gwages=Swages[1]*wagedistarray[1,2]
    @inbounds for i in 2:nrows
        Gwages+=wagedistarray[i,2]*(Swages[i]+Swages[i-1])
    end

    return 1-(Gwages/Swages[length(Swages)])
end

which gives me elapsed time: 0.042070662 seconds (80112 bytes allocated)

Finally, check out this version, which is actually faster than all and is also the most accurate I think:

function gini2(wagedistarray)
    Swages = cumsum(wagedistarray[:,1].*wagedistarray[:,2])
    Gwages = Swages[1]*wagedistarray[1,2] +
                sum(wagedistarray[2:end,2] .* 
                        (Swages[2:end]+Swages[1:end-1]))
    return 1 - Gwages/Swages[end]
end

Which has elapsed time: 0.00041119 seconds (721664 bytes allocated). The main benefit was changing from a O(n^2) double for loop to the O(n) cumsum.

like image 61
IainDunning Avatar answered Sep 20 '22 18:09

IainDunning