Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python 3 pandas.groupby.filter

I am trying to perform a groupby filter that is very similar to the example in this documentation: pandas groupby filter

>>> df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
...                           'foo', 'bar'],
...                    'B' : [1, 2, 3, 4, 5, 6],
...                    'C' : [2.0, 5., 8., 1., 2., 9.]})
>>> grouped = df.groupby('A')
>>> grouped.filter(lambda x: x['B'].mean() > 3.)
     A  B    C
1  bar  2  5.0
3  bar  4  1.0
5  bar  6  9.0

I am trying to return a DataFrame that has all 3 columns, but only 2 rows. Those 2 rows contain the minimum values of column B, after grouping by column A. I tried the following line of code:

grouped.filter(lambda x: x['B'] == x['B'].min())

But this doesn't work, and I get this error: TypeError: filter function returned a Series, but expected a scalar bool

The DataFrame I am trying to return should look like this:

    A   B   C
0  foo  1  2.0
1  bar  2  5.0

I would appreciate any help you can provide. Thank you, in advance, for your help.

like image 999
FinProg Avatar asked Feb 15 '19 21:02

FinProg


People also ask

What is possible using Groupby () method of pandas?

groupby() can accept several different arguments: A column or list of columns. A dict or pandas Series. A NumPy array or pandas Index , or an array-like iterable of these.

What is the difference between aggregating transforming and filtering data?

If you want to get a single value for each group, use aggregate() (or one of its shortcuts). If you want to get a subset of the original rows, use filter() . And if you want to get a new value for each original row, use transpose() .

What are the three phases of the pandas Groupby () function?

The “group by” process: split-apply-combine (1) Splitting the data into groups. (2). Applying a function to each group independently, (3) Combining the results into a data structure.

How do you filter in Groupby?

GROUP BY enables you to use aggregate functions on groups of data returned from a query. FILTER is a modifier used on an aggregate function to limit the values used in an aggregation. All the columns in the select statement that aren't aggregated should be specified in a GROUP BY clause in the query.


1 Answers

The short answer:

grouped.apply(lambda x: x[x['B'] == x['B']].min())

... and the longer one:

Your grouped object has 2 groups:

In[25]: for df in grouped:
   ...:     print(df)
   ...:     
('bar',      
     A  B    C
1  bar  2  5.0
3  bar  4  1.0
5  bar  6  9.0)

('foo',      
     A  B    C
0  foo  1  2.0
2  foo  3  8.0
4  foo  5  2.0)

filter() method for GroupBy object is for filtering groups as entities, NOT for filtering their individual rows. So using the filter() method, you may obtain only 4 results:

  • an empty DataFrame (0 rows),
  • rows of the group 'bar' (3 rows),
  • rows of the group 'foo' (3 rows),
  • rows of both groups (6 rows)

Nothing else, regardless of the used parameter (boolean function) in the filter() method.


So you have to use some other method. An appropriate one is the very flexible apply() method, which lets you apply an arbitrary function which

  • takes a DataFrame (a group of GroupBy object) as its only parameter,
  • returns either a Pandas object or a scalar.

In your case that function should return (for every of your 2 groups) the 1-row DataFrame having the minimal value in the column 'B', so we will use the Boolean mask

group['B'] == group['B'].min()

for selecting such a row (or - maybe - more rows):

In[26]: def select_min_b(group):
   ...:     return group[group['B'] == group['B'].min()]

Now using this function as a parameter of the apply() method of GroupBy object grouped we will obtain

In[27]: grouped.apply(select_min_b)
Out[27]: 
         A  B    C
A                 
bar 1  bar  2  5.0
foo 0  foo  1  2.0

Note:

The same, but as only one command (using the lambda function):

grouped.apply(lambda group: group[group['B'] == group['B']].min())
like image 68
MarianD Avatar answered Oct 24 '22 09:10

MarianD