I am trying to perform a groupby filter that is very similar to the example in this documentation: pandas groupby filter
>>> df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
... 'foo', 'bar'],
... 'B' : [1, 2, 3, 4, 5, 6],
... 'C' : [2.0, 5., 8., 1., 2., 9.]})
>>> grouped = df.groupby('A')
>>> grouped.filter(lambda x: x['B'].mean() > 3.)
A B C
1 bar 2 5.0
3 bar 4 1.0
5 bar 6 9.0
I am trying to return a DataFrame that has all 3 columns, but only 2 rows. Those 2 rows contain the minimum values of column B, after grouping by column A. I tried the following line of code:
grouped.filter(lambda x: x['B'] == x['B'].min())
But this doesn't work, and I get this error:
TypeError: filter function returned a Series, but expected a scalar bool
The DataFrame I am trying to return should look like this:
A B C
0 foo 1 2.0
1 bar 2 5.0
I would appreciate any help you can provide. Thank you, in advance, for your help.
groupby() can accept several different arguments: A column or list of columns. A dict or pandas Series. A NumPy array or pandas Index , or an array-like iterable of these.
If you want to get a single value for each group, use aggregate() (or one of its shortcuts). If you want to get a subset of the original rows, use filter() . And if you want to get a new value for each original row, use transpose() .
The “group by” process: split-apply-combine (1) Splitting the data into groups. (2). Applying a function to each group independently, (3) Combining the results into a data structure.
GROUP BY enables you to use aggregate functions on groups of data returned from a query. FILTER is a modifier used on an aggregate function to limit the values used in an aggregation. All the columns in the select statement that aren't aggregated should be specified in a GROUP BY clause in the query.
The short answer:
grouped.apply(lambda x: x[x['B'] == x['B']].min())
... and the longer one:
Your grouped
object has 2 groups:
In[25]: for df in grouped:
...: print(df)
...:
('bar',
A B C
1 bar 2 5.0
3 bar 4 1.0
5 bar 6 9.0)
('foo',
A B C
0 foo 1 2.0
2 foo 3 8.0
4 foo 5 2.0)
filter()
method for GroupBy object is for filtering groups as entities, NOT for filtering their individual rows. So using the filter()
method, you may obtain only 4 results:
Nothing else, regardless of the used parameter (boolean function) in the filter()
method.
So you have to use some other method. An appropriate one is the very flexible apply()
method, which lets you apply an arbitrary function which
In your case that function should return (for every of your 2 groups) the 1-row DataFrame having the minimal value in the column 'B'
, so we will use the Boolean mask
group['B'] == group['B'].min()
for selecting such a row (or - maybe - more rows):
In[26]: def select_min_b(group):
...: return group[group['B'] == group['B'].min()]
Now using this function as a parameter of the apply()
method of GroupBy object grouped
we will obtain
In[27]: grouped.apply(select_min_b)
Out[27]:
A B C
A
bar 1 bar 2 5.0
foo 0 foo 1 2.0
Note:
The same, but as only one command (using the lambda
function):
grouped.apply(lambda group: group[group['B'] == group['B']].min())
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With