I have the following dataframe:
df = pd.DataFrame.from_dict({'case': ['foo', 'foo', 'foo', 'foo', 'bar'],
'cluster': [1, 1, 1, 2, 1],
'conf': [1, 2, 3, 1, 1]})
df
Out[3]:
case cluster conf
0 foo 1 1
1 foo 1 2
2 foo 1 3
3 foo 2 1
4 bar 1 1
If I group by 'case' and 'cluster', I can remove the elements belonging to groups with only 1 element:
df.groupby(['case', 'cluster']).filter(lambda x: len(x) > 1)
Out[4]:
case cluster conf
0 foo 1 1
1 foo 1 2
2 foo 1 3
I can also compute the mean number of elements per group for each 'case' value:
df.groupby(['case', 'cluster']).size().mean(level='case')
Out[5]:
case
bar 1
foo 2
dtype: int64
But, how can I filter out the elements belonging to groups with less elements than the corresponding mean value? The output I am expecting is:
case cluster conf
0 foo 1 1
1 foo 1 2
2 foo 1 3
4 bar 1 1
What is the GroupBy function? Pandas' GroupBy is a powerful and versatile function in Python. It allows you to split your data into separate groups to perform computations for better analysis.
To group Pandas dataframe, we use groupby(). To sort grouped dataframe in ascending or descending order, use sort_values(). The size() method is used to get the dataframe size.
Pandas groupby is used for grouping the data according to the categories and apply a function to the categories. It also helps to aggregate data efficiently. Pandas dataframe. groupby() function is used to split the data into groups based on some criteria.
You can group DataFrame rows into a list by using pandas. DataFrame. groupby() function on the column of interest, select the column you want as a list from group and then use Series. apply(list) to get the list for every group.
You can use the name
parameter of a group to perform a lookup on the mean group size Series while using filter
:
grp_mean = df.groupby(['case', 'cluster']).size().mean(level='case')
df = df.groupby(['case', 'cluster']).filter(lambda x: len(x) >= grp_mean[x.name[0]])
As pointed out by @MaxU, this could be slightly sped up by factoring out the groupby
:
g = df.groupby(['case', 'cluster'])
grp_mean = g.size().mean(level='case')
df = g.filter(lambda x: len(x) >= grp_mean[x.name[0]])
The resulting output:
case cluster conf
0 foo 1 1
1 foo 1 2
2 foo 1 3
4 bar 1 1
a = 2;b =1
pd.concat( [df[(df.conf >= a) & (df.case == 'foo')], df[(df.conf >= b) & (df.case == 'bar')] ])
case cluster conf
1 foo 1 2
2 foo 1 3
4 bar 1 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With