in pandas I want to do:
df.groupby('A').filter(lambda x: x.name > 0)
- group by column A
and then filter groups that have the value of the name non positive. However this canceles the grouping as GroupBy.filter
returns DataFrame
and thus losing the groupings. I want to do it in this order as it should be less computationaly demanding because filter
followed by groupby
would walk the DataFrame twice no (first filtering and then grouping)? Also cloning the groups from the grouping (to a dict or something) would lose me the functionality to seamlessly go back to dataframe (like in the example of .filter
that you directly get the DataFrame
)
Thanks
Example:
A B
1 -1 1
2 -1 2
3 0 2
4 1 1
5 1 2
df.groupby('A')
:
GroupBy object
-1 : [1, 2]
0 : [3]
1 : [4,5]
GroupBy.filter(lambda x: x.name >= 0)
:
GroupBy object
0 : [3]
1 : [4,5]
I think the previous answers propose workarounds, which are maybe useful in your case but doesn't answer the question.
You created groups, and you want to throw out or keep some groups based on group statistics THEN perform some group statistics you actually care for on the groups. This should be possible, and useful in many cases, however, it is not possible now as a chained command (as far as I know) only if you use two identical groupbys consequently.
Let's make a case: Groupby reveals some features that are not filterable on an item level basis (so previous filtering is not an option). For example a group sum. The annoyment in filter is, that it returns a dataframe rather than keeping the grouping and allow you to perform further computations on the groups.
Here is an example:
Let's say you want to group by 'C' and filter on the sums of 'A' in the groups (<700), but in the filtered groups you actually care for the std of the groups. If filter would just be a filter on groups, this would work:
df.groupby(['C']).filter(lambda x:x['A'].sum()<700, combine=False).std()
this doesn't work (note the nonexistent combine=False
option on filter), what does is this:
df.groupby(['C']).filter(lambda x:x['A'].sum()<700).groupby(['C']).std()
What filter does is actually filter&combine, which follows the split-apply-combine logic.
Using groupby
doesn't actually aggregate the values in any way. It just creates the groupings, so the filter
is essentially filtering on the original dataframe. I don't think you're saving time or computation by grouping first unless name
is value achieved by applying a function to the groups.
Therefore I'd recommend something like
df.where(df.name > 0).groupby('A') # now apply some transformation to the groups
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With