Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I use Python to aggregate data from multiple directors in various companies into one figure per company using Blau's Index?

I have a dataframe which contains categorized data about the educational backgrounds of the directors of several companies. Currently, each company (recorded by its ticker) has multiple entries, one per director, and the df looks something like this:

Ticker  Education
ABC     1
ABC     1
ABC     5
ABC     7
ABC     5
DEF     3
DEF     4
DEF     4
DEF     4
DEF     6

I want to use the Blau's Index formula (same as the Gini-Simpson Index) to create a new dataframe with only one entry per company as follows:

Ticker  Education Diversity
ABC     0.64
DEF     0.56

The formula used is (1 - ∑pi2) where pi is the proportion of directors in each of the i education categories; e.g. for company ABC, p1 = 2/5.

Can anyone help me implement this in Python (3.7)? Any help would be greatly appreciated!

like image 523
amiskov Avatar asked Oct 15 '25 04:10

amiskov


1 Answers

You could try implenting your own def then use groupby.apply. Finally, Series.reset_index to get back to DataFrame format:

def blaus_index(arr):
    return 1 - sum((arr.value_counts() / len(arr)) ** 2)

df.groupby('Ticker')['Education'].apply(blaus_index).reset_index(name='Education Diversity')

  Ticker  Education Diversity
0    ABC                 0.64
1    DEF                 0.56
like image 84
Chris Adams Avatar answered Oct 17 '25 18:10

Chris Adams