I have a dataframe which contains categorized data about the educational backgrounds of the directors of several companies. Currently, each company (recorded by its ticker) has multiple entries, one per director, and the df looks something like this:
Ticker Education
ABC 1
ABC 1
ABC 5
ABC 7
ABC 5
DEF 3
DEF 4
DEF 4
DEF 4
DEF 6
I want to use the Blau's Index formula (same as the Gini-Simpson Index) to create a new dataframe with only one entry per company as follows:
Ticker Education Diversity
ABC 0.64
DEF 0.56
The formula used is (1 - ∑pi2) where pi is the proportion of directors in each of the i education categories; e.g. for company ABC, p1 = 2/5.
Can anyone help me implement this in Python (3.7)? Any help would be greatly appreciated!
You could try implenting your own def
then use groupby.apply
. Finally, Series.reset_index
to get back to DataFrame
format:
def blaus_index(arr):
return 1 - sum((arr.value_counts() / len(arr)) ** 2)
df.groupby('Ticker')['Education'].apply(blaus_index).reset_index(name='Education Diversity')
Ticker Education Diversity
0 ABC 0.64
1 DEF 0.56
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With