Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

calculating Gini coefficient in Python/numpy

Tags:

i'm calculating Gini coefficient (similar to: Python - Gini coefficient calculation using Numpy) but i get an odd result. for a uniform distribution sampled from np.random.rand(), the Gini coefficient is 0.3 but I would have expected it to be close to 0 (perfect equality). what is going wrong here?

def G(v):     bins = np.linspace(0., 100., 11)     total = float(np.sum(v))     yvals = []     for b in bins:         bin_vals = v[v <= np.percentile(v, b)]         bin_fraction = (np.sum(bin_vals) / total) * 100.0         yvals.append(bin_fraction)     # perfect equality area     pe_area = np.trapz(bins, x=bins)     # lorenz area     lorenz_area = np.trapz(yvals, x=bins)     gini_val = (pe_area - lorenz_area) / float(pe_area)     return bins, yvals, gini_val  v = np.random.rand(500) bins, result, gini_val = G(v) plt.figure() plt.subplot(2, 1, 1) plt.plot(bins, result, label="observed") plt.plot(bins, bins, '--', label="perfect eq.") plt.xlabel("fraction of population") plt.ylabel("fraction of wealth") plt.title("GINI: %.4f" %(gini_val)) plt.legend() plt.subplot(2, 1, 2) plt.hist(v, bins=20) 

for the given set of numbers, the above code calculates the fraction of the total distribution's values that are in each percentile bin.

the result:

enter image description here

uniform distributions should be near "perfect equality" so the lorenz curve bending is off.

like image 544
mvd Avatar asked Sep 15 '16 13:09

mvd


People also ask

How do you calculate Gini coefficient in Python?

Gini in Python Sorting the datapoints in ascending order and assigning a positional index i yields G=∑i(2i−n−1)xin∑ixi, G = ∑ i ( 2 i − n − 1 ) x i n ∑ i x i , which is even speedier to compute.

How do you calculate Gini coefficient?

The Gini coefficient can be calculated using the formula: Gini Coefficient = A / (A + B), where A is the area above the Lorenz Curve and B is the area below the Lorenz Curve.

What is Gini Python?

The Gini impurity measure is one of the methods used in decision tree algorithms to decide the optimal split from a root node, and subsequent splits.


1 Answers

This is to be expected. A random sample from a uniform distribution does not result in uniform values (i.e. values that are all relatively close to each other). With a little calculus, it can be shown that the expected value (in the statistical sense) of the Gini coefficient of a sample from the uniform distribution on [0, 1] is 1/3, so getting values around 1/3 for a given sample is reasonable.

You'll get a lower Gini coefficient with a sample such as v = 10 + np.random.rand(500). Those values are all close to 10.5; the relative variation is lower than the sample v = np.random.rand(500). In fact, the expected value of the Gini coefficient for the sample base + np.random.rand(n) is 1/(6*base + 3).

Here's a simple implementation of the Gini coefficient. It uses the fact that the Gini coefficient is half the relative mean absolute difference.

def gini(x):     # (Warning: This is a concise implementation, but it is O(n**2)     # in time and memory, where n = len(x).  *Don't* pass in huge     # samples!)      # Mean absolute difference     mad = np.abs(np.subtract.outer(x, x)).mean()     # Relative mean absolute difference     rmad = mad/np.mean(x)     # Gini coefficient     g = 0.5 * rmad     return g 

Here's the Gini coefficient for several samples of the form v = base + np.random.rand(500):

In [80]: v = np.random.rand(500)  In [81]: gini(v) Out[81]: 0.32760618249832563  In [82]: v = 1 + np.random.rand(500)  In [83]: gini(v) Out[83]: 0.11121487509454202  In [84]: v = 10 + np.random.rand(500)  In [85]: gini(v) Out[85]: 0.01567937753659053  In [86]: v = 100 + np.random.rand(500)  In [87]: gini(v) Out[87]: 0.0016594595244509495 
like image 60
Warren Weckesser Avatar answered Oct 10 '22 04:10

Warren Weckesser