Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove groups with size smaller than mean group size in pandas

Tags:

python

pandas

I have the following dataframe:

df = pd.DataFrame.from_dict({'case': ['foo', 'foo', 'foo', 'foo', 'bar'],
                             'cluster': [1, 1, 1, 2, 1],
                             'conf': [1, 2, 3, 1, 1]})

df
Out[3]: 
  case  cluster  conf
0  foo        1     1
1  foo        1     2
2  foo        1     3
3  foo        2     1
4  bar        1     1

If I group by 'case' and 'cluster', I can remove the elements belonging to groups with only 1 element:

df.groupby(['case', 'cluster']).filter(lambda x: len(x) > 1)
Out[4]: 
  case  cluster  conf
0  foo        1     1
1  foo        1     2
2  foo        1     3 

I can also compute the mean number of elements per group for each 'case' value:

df.groupby(['case', 'cluster']).size().mean(level='case')
Out[5]: 
case
bar    1
foo    2
dtype: int64 

But, how can I filter out the elements belonging to groups with less elements than the corresponding mean value? The output I am expecting is:

  case  cluster  conf
0  foo        1     1
1  foo        1     2
2  foo        1     3
4  bar        1     1
like image 789
saltimbanqui Avatar asked May 10 '17 17:05

saltimbanqui


People also ask

What does Group_by do in pandas?

What is the GroupBy function? Pandas' GroupBy is a powerful and versatile function in Python. It allows you to split your data into separate groups to perform computations for better analysis.

Can you sort a Groupby pandas?

To group Pandas dataframe, we use groupby(). To sort grouped dataframe in ascending or descending order, use sort_values(). The size() method is used to get the dataframe size.

What is group by () in pandas library?

Pandas groupby is used for grouping the data according to the categories and apply a function to the categories. It also helps to aggregate data efficiently. Pandas dataframe. groupby() function is used to split the data into groups based on some criteria.

How do I group specific rows in pandas?

You can group DataFrame rows into a list by using pandas. DataFrame. groupby() function on the column of interest, select the column you want as a list from group and then use Series. apply(list) to get the list for every group.


2 Answers

You can use the name parameter of a group to perform a lookup on the mean group size Series while using filter:

grp_mean = df.groupby(['case', 'cluster']).size().mean(level='case')
df = df.groupby(['case', 'cluster']).filter(lambda x: len(x) >= grp_mean[x.name[0]])

As pointed out by @MaxU, this could be slightly sped up by factoring out the groupby:

g = df.groupby(['case', 'cluster'])
grp_mean = g.size().mean(level='case')
df = g.filter(lambda x: len(x) >= grp_mean[x.name[0]])

The resulting output:

  case  cluster  conf
0  foo        1     1
1  foo        1     2
2  foo        1     3
4  bar        1     1
like image 149
root Avatar answered Nov 14 '22 22:11

root


a = 2;b =1
pd.concat( [df[(df.conf >= a) & (df.case == 'foo')], df[(df.conf >= b) & (df.case == 'bar')] ])

  case  cluster  conf
1  foo  1        2   
2  foo  1        3   
4  bar  1        1   
like image 34
galaxyan Avatar answered Nov 14 '22 22:11

galaxyan