In the following snippet data
is a pandas.DataFrame
and indices
is a set of columns of the data
. After grouping the data with groupby
I am interested in the ids of the groups, but only those with a size greater than a threshold (say: 3).
group_ids=data.groupby(list(data.columns[list(indices)])).grouper.group_info[0]
Now, how can I find which group has a size greater than or equal 3 knowing the id of the group? I only want ids of groups with a certain size.
#TODO: filter out ids from group_ids which correspond to groups with sizes < 3
One way is to use the size
method of the groupby
:
g = data.groupby(...)
size = g.size()
size[size > 3]
For example, here there is only one group of size > 1:
In [11]: df = pd.DataFrame([[1, 2], [3, 4], [1,6]], columns=['A', 'B'])
In [12]: df
Out[12]:
A B
0 1 2
1 3 4
2 1 6
In [13]: g = df.groupby('A')
In [14]: size = g.size()
In [15]: size[size > 1]
Out[15]:
A
1 2
dtype: int64
If you were interested in just restricting the DataFrame to those in large groups you could use the filter method:
In [21]: g.filter(lambda x: len(x) > 1)
Out[21]:
A B
0 1 2
2 1 6
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With