To illustrate my problem I have an example dataframe
df = pd.DataFrame({'key1': [0, 0, 0, 0, 1, 1, 1, 1, 1, 1],
'key2': ['a', 'b', 'b', 'c', 'a', 'a', 'a', 'b', 'b', 'c'],
'key3': [10, 5, 15, 10, 5, 10, 20, 10, 20, 5],
'zdata': [2, 4, 2, 4, 3, 5, 6, 5, 5, 6]})
which when
df1 = df.groupby(['key1', 'key2', 'key3'])
df1 = df1.agg({'zdata' : [np.mean]})
produces
zdata
mean
key1 key2 key3
0 a 10 2
b 5 4 * (new group1)
15 2 * (new group1)
c 10 4
1 a 5 3 * (new group2)
10 5 * (new group2)
20 6 * (new group2)
b 10 5
20 5
c 5 6
I now need to find the mean of zdata and key3 by grouping key1 and key2 only where key3 has a '5' and any additional number ( ie. the rows marked with an * in the above table)
So the desired result would be
zdata
mean
key1 key2 key3
0 b 10 3
1 a 11.6 4.6
I am certain a lambda function will get there like this example [Python Pandas Conditional Sums but this problem is grouped on multiple columns
Any help would be much appreciated.
I think better is use groupby with mean as agg, because result is DataFrame with no Multiindex in columns. Then you can use filter with any and len and last groupby with mean again:
#instaed of agg use:
df1 = df.groupby(['key1', 'key2', 'key3'], as_index=False)['zdata'].mean()
print df1
key1 key2 key3 zdata
0 0 a 10 2
1 0 b 5 4
2 0 b 15 2
3 0 c 10 4
4 1 a 5 3
5 1 a 10 5
6 1 a 20 6
7 1 b 10 5
8 1 b 20 5
9 1 c 5 6
#filter rows by conditions
df2 = df1.groupby(['key1', 'key2']).filter(lambda x: (x.key3 == 5).any() & (len(x) > 1))
print df2
key1 key2 key3 zdata
1 0 b 5 4
2 0 b 15 2
4 1 a 5 3
5 1 a 10 5
6 1 a 20 6
print df2.groupby(['key1', 'key2'], as_index=False).mean()
key1 key2 key3 zdata
0 0 b 10.000000 3.000000
1 1 a 11.666667 4.666667
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With