Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter groups after GroupBy in pandas while keeping the groups

in pandas I want to do: df.groupby('A').filter(lambda x: x.name > 0) - group by column A and then filter groups that have the value of the name non positive. However this canceles the grouping as GroupBy.filter returns DataFrame and thus losing the groupings. I want to do it in this order as it should be less computationaly demanding because filter followed by groupby would walk the DataFrame twice no (first filtering and then grouping)? Also cloning the groups from the grouping (to a dict or something) would lose me the functionality to seamlessly go back to dataframe (like in the example of .filter that you directly get the DataFrame)

Thanks

Example:

   A  B
1 -1  1
2 -1  2
3  0  2
4  1  1
5  1  2

df.groupby('A'):

GroupBy object
-1 : [1, 2]
 0 : [3]
 1 : [4,5]

GroupBy.filter(lambda x: x.name >= 0):

GroupBy object
 0 : [3]
 1 : [4,5]
like image 878
Péťa Poliak Avatar asked Apr 14 '18 13:04

Péťa Poliak


2 Answers

I think the previous answers propose workarounds, which are maybe useful in your case but doesn't answer the question.

You created groups, and you want to throw out or keep some groups based on group statistics THEN perform some group statistics you actually care for on the groups. This should be possible, and useful in many cases, however, it is not possible now as a chained command (as far as I know) only if you use two identical groupbys consequently.

Let's make a case: Groupby reveals some features that are not filterable on an item level basis (so previous filtering is not an option). For example a group sum. The annoyment in filter is, that it returns a dataframe rather than keeping the grouping and allow you to perform further computations on the groups.

Here is an example:

Let's say you want to group by 'C' and filter on the sums of 'A' in the groups (<700), but in the filtered groups you actually care for the std of the groups. If filter would just be a filter on groups, this would work:

df.groupby(['C']).filter(lambda x:x['A'].sum()<700, combine=False).std()

this doesn't work (note the nonexistent combine=False option on filter), what does is this:

df.groupby(['C']).filter(lambda x:x['A'].sum()<700).groupby(['C']).std()

What filter does is actually filter&combine, which follows the split-apply-combine logic.

like image 52
petsol Avatar answered Sep 27 '22 21:09

petsol


Using groupby doesn't actually aggregate the values in any way. It just creates the groupings, so the filter is essentially filtering on the original dataframe. I don't think you're saving time or computation by grouping first unless name is value achieved by applying a function to the groups.

Therefore I'd recommend something like

df.where(df.name > 0).groupby('A')  # now apply some transformation to the groups
like image 21
David Stevens Avatar answered Sep 27 '22 21:09

David Stevens