I'm trying to implement the following formula in Julia for calculating the Gini coefficient of a wage distribution:
where
Here's a simplified version of the code I'm using for this:
# Takes a array where first column is value of wages
# (y_i in formula), and second column is probability
# of wage value (f(y_i) in formula).
function gini(wagedistarray)
# First calculate S values in formula
for i in 1:length(wagedistarray[:,1])
for j in 1:i
Swages[i]+=wagedistarray[j,2]*wagedistarray[j,1]
end
end
# Now calculate value to subtract from 1 in gini formula
Gwages = Swages[1]*wagedistarray[1,2]
for i in 2:length(Swages)
Gwages += wagedistarray[i,2]*(Swages[i]+Swages[i-1])
end
# Final step of gini calculation
return giniwages=1-(Gwages/Swages[length(Swages)])
end
wagedistarray=zeros(10000,2)
Swages=zeros(length(wagedistarray[:,1]))
for i in 1:length(wagedistarray[:,1])
wagedistarray[i,1]=1
wagedistarray[i,2]=1/10000
end
@time result=gini(wagedistarray)
It gives a value of near zero, which is what you expect for a completely equal wage distribution. However, it takes quite a long time: 6.796 secs.
Any ideas for improvement?
Try this:
function gini(wagedistarray)
nrows = size(wagedistarray,1)
Swages = zeros(nrows)
for i in 1:nrows
for j in 1:i
Swages[i] += wagedistarray[j,2]*wagedistarray[j,1]
end
end
Gwages=Swages[1]*wagedistarray[1,2]
for i in 2:nrows
Gwages+=wagedistarray[i,2]*(Swages[i]+Swages[i-1])
end
return 1-(Gwages/Swages[length(Swages)])
end
wagedistarray=zeros(10000,2)
for i in 1:size(wagedistarray,1)
wagedistarray[i,1]=1
wagedistarray[i,2]=1/10000
end
@time result=gini(wagedistarray)
5.913907256 seconds (4000481676 bytes allocated, 25.37% gc time)
0.134799301 seconds (507260 bytes allocated)
elapsed time: 0.123665107 seconds (80112 bytes allocated)
The primary problems are that Swages
was a global variable (wasn't living in the function) which is not a good coding practice, but more importantly is a performance killer. The other thing I noticed was length(wagedistarray[:,1])
, which makes a copy of that column and then asks its length - that was generating some extra "garbage". The second run is faster because there is some compilation time the very first time the function is run.
You crank performance even higher using @inbounds
, i.e.
function gini(wagedistarray)
nrows = size(wagedistarray,1)
Swages = zeros(nrows)
@inbounds for i in 1:nrows
for j in 1:i
Swages[i] += wagedistarray[j,2]*wagedistarray[j,1]
end
end
Gwages=Swages[1]*wagedistarray[1,2]
@inbounds for i in 2:nrows
Gwages+=wagedistarray[i,2]*(Swages[i]+Swages[i-1])
end
return 1-(Gwages/Swages[length(Swages)])
end
which gives me elapsed time: 0.042070662 seconds (80112 bytes allocated)
Finally, check out this version, which is actually faster than all and is also the most accurate I think:
function gini2(wagedistarray)
Swages = cumsum(wagedistarray[:,1].*wagedistarray[:,2])
Gwages = Swages[1]*wagedistarray[1,2] +
sum(wagedistarray[2:end,2] .*
(Swages[2:end]+Swages[1:end-1]))
return 1 - Gwages/Swages[end]
end
Which has elapsed time: 0.00041119 seconds (721664 bytes allocated)
. The main benefit was changing from a O(n^2) double for loop to the O(n) cumsum
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With