Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create share variable after combined groupby in pandas dataframe

I am having trouble describing my problem, so I'll jump right into it. Here is some test data:

import pandas as pd
df = pd.DataFrame(data={"family":["Smith","Miller","Simpson","Miller","Simpson","Smith","Miller","Simpson","Miller"],
                    "first_name":["Anna","Bart","Lisa","Ida","Paul","Bridget","Harry","Dustin","George"],
                    "shirt_color":["green","yellow","red","yellow","green","red","yellow","red","red"]})

Now I'd like to create a new column in my original dataframe that contains the share of shirt_color per family, so every row with e.g. family Miller and shirt_color yellow has the same value 0.75, etc..

I have tried several approaches, but did not succeed with any of them.

df = df.groupby("family").apply(lambda x: x.groupby("shirt_color").apply(lambda x: x.size()/familysize))

This seemed promising, but as you can see, I could not access the number of family members in the last lambda function any more. I also tried to create a groupby object of only family and iterate over the dataframes, grouping all dataframes by color individually, but somehow I could not manage to put the dataframes back to one in the end.

This doesn't seem to be a very exotic thing to do with a dataframe, so I'm sure there is an easy way to do this, but I'm out of ideas.

Thanks a lot for your help in advance!

like image 642
meansquare Avatar asked Dec 24 '22 03:12

meansquare


1 Answers

In my opinion, you should avoid apply as this leads to an inefficient Python-level loop. Here's an alternative solution using GroupBy + transform:

f = df.groupby('family')['first_name'].transform('size')
g = df.groupby(['family', 'shirt_color'])['first_name'].transform('size')

df['ratio'] = g / f

print(df)

    family first_name shirt_color     ratio
0    Smith       Anna       green  0.500000
1   Miller       Bart      yellow  0.750000
2  Simpson       Lisa         red  0.666667
3   Miller        Ida      yellow  0.750000
4  Simpson       Paul       green  0.333333
5    Smith    Bridget         red  0.500000
6   Miller      Harry      yellow  0.750000
7  Simpson     Dustin         red  0.666667
8   Miller     George         red  0.250000
like image 167
jpp Avatar answered Dec 25 '22 16:12

jpp