I am having trouble describing my problem, so I'll jump right into it. Here is some test data:
import pandas as pd
df = pd.DataFrame(data={"family":["Smith","Miller","Simpson","Miller","Simpson","Smith","Miller","Simpson","Miller"],
"first_name":["Anna","Bart","Lisa","Ida","Paul","Bridget","Harry","Dustin","George"],
"shirt_color":["green","yellow","red","yellow","green","red","yellow","red","red"]})
Now I'd like to create a new column in my original dataframe that contains the share of shirt_color per family, so every row with e.g. family Miller and shirt_color yellow has the same value 0.75, etc..
I have tried several approaches, but did not succeed with any of them.
df = df.groupby("family").apply(lambda x: x.groupby("shirt_color").apply(lambda x: x.size()/familysize))
This seemed promising, but as you can see, I could not access the number of family members in the last lambda function any more. I also tried to create a groupby object of only family and iterate over the dataframes, grouping all dataframes by color individually, but somehow I could not manage to put the dataframes back to one in the end.
This doesn't seem to be a very exotic thing to do with a dataframe, so I'm sure there is an easy way to do this, but I'm out of ideas.
Thanks a lot for your help in advance!
In my opinion, you should avoid apply
as this leads to an inefficient Python-level loop. Here's an alternative solution using GroupBy
+ transform
:
f = df.groupby('family')['first_name'].transform('size')
g = df.groupby(['family', 'shirt_color'])['first_name'].transform('size')
df['ratio'] = g / f
print(df)
family first_name shirt_color ratio
0 Smith Anna green 0.500000
1 Miller Bart yellow 0.750000
2 Simpson Lisa red 0.666667
3 Miller Ida yellow 0.750000
4 Simpson Paul green 0.333333
5 Smith Bridget red 0.500000
6 Miller Harry yellow 0.750000
7 Simpson Dustin red 0.666667
8 Miller George red 0.250000
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With