Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - Pandas Data manipulation to calculate Gini Coefficient

I am having dataset which is of the following shape:

tconst  GreaterEuropean British WestEuropean    Italian French  Jewish  Germanic    Nordic  Asian   GreaterEastAsian    Japanese    Hispanic    GreaterAfrican  Africans    EastAsian   Muslim  IndianSubContinent  total_ethnicities
0   tt0000001   3   1   2   0   1   0   0   1   0   0   0   0   0   0   0   0   0   8
1   tt0000002   2   0   2   0   2   0   0   0   0   0   0   0   0   0   0   0   0   6
2   tt0000003   4   0   3   0   3   1   0   0   0   0   0   0   0   0   0   0   0   11
3   tt0000004   2   0   2   0   2   0   0   0   0   0   0   0   0   0   0   0   0   6
4   tt0000005   3   2   1   0   0   0   1   0   0   0   0   0   0   0   0   0   0   7

It is IMDB data and after processing, I created these columns which represents there are this many number of ethnic actors in a movie (tcons).

I want to create another column df["diversity"] which is:

( diversity score "gini index")

For example: for each movie lets say we have 10 actors; 3 asian, 3 British, 3 african american and 1 french. so we divide by total 3/10 3/ 10 3/10 1/10 then 1 minus the summation of ( 3/10 ) square ( 3/ 10) square ( 3/10) square (1/10) square add the score of each actor to a column as diversity.

I am trying simple pandas manipulation, but not getting there.

EDIT:

for the first row, we have total ethnicities as 8

3 GreaterEuropean
1 British
2 WestEuropean
1 French
1 nordic

so the score will be

1- [(3/8)^2 + (1/8)^2 + (2/8)^2 + (1/8)^2 + (1/8)^2]

like image 271
Shivam Avatar asked Oct 18 '25 14:10

Shivam


2 Answers

You can make use of numpy vectorization here i.e

one = df.drop(['total_ethnicities'],1).values
# Select the values other than total_ethnicities
two = df['total_ethnicities'].values[:,None]
# Select the values of total_ethnicities
df['diversity'] = 1 - pd.np.sum((one/two)**2, axis=1)
# Divide the values of one by two, square them. Sum over the axis. Then subtract from 1. 
df['diversity']

tconst
tt0000001    0.750000
tt0000002    0.666667
tt0000003    0.710744
tt0000004    0.666667
tt0000005    0.693878
Name: diversity, dtype: float64
like image 74
Bharath Avatar answered Oct 21 '25 02:10

Bharath


df2 = df.set_index('tconst')
total = df2.pop('total_ethnicities')
result = 1 - ((df2** 2 ).div(total**2, axis=0)).sum(axis=1)
result.name = 'gini'
tconst
tt0000001    0.750000
tt0000002    0.666667
tt0000003    0.710744
tt0000004    0.666667
tt0000005    0.693878
Name: gini, dtype: float64

Apart from that, I always try to keep my raw data separate from my parsed data, so I would keep the columns total_etnicities in a separate series, and only when needed for the reporting of the results would I combine them

If you really want this result as an extra column in df, you can do this by:

df = df.join(result, on='tconst')
like image 25
Maarten Fabré Avatar answered Oct 21 '25 02:10

Maarten Fabré



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!