<code>DataFrameGroupby.filter</code> method filters the groups, and returns the <code>DataFrame</code> that contains the rows that passed the filter. But what can I do to obtain a new <code>DataFrameGroupBy</code> object instead of a <code>DataFrame</code> after filtration? For example, let's say I have a <code>DataFrame</code> <code>df</code> with two columns <code>A</code> and <code>B</code>. I want to obtain average value of column <code>B</code> for each value of column <code>A</code>, as long as there's at least 5 rows in that group: <pre class="prettyprint"><code># pandas 0.18.0 # doesn't work because `filter` returns a DF not a GroupBy object df.groupby('A').filter(lambda x: len(x)>=5).mean() # works but slower and awkward to write because needs to groupby('A') twice df.groupby('A').filter(lambda x: len(x)>=5).reset_index().groupby('A').mean() # works but more verbose than chaining groups = df.groupby('A') groups.mean()[groups.size() >= 5] </code></pre>

You can do it this way: <pre class="prettyprint"><code>In [310]: df Out[310]: a b 0 1 4 1 7 3 2 6 9 3 4 4 4 0 2 5 8 4 6 7 7 7 0 5 8 8 5 9 8 7 10 6 1 11 3 8 12 7 4 13 8 0 14 5 3 15 5 3 16 8 1 17 7 2 18 9 9 19 3 2 20 9 1 21 1 2 22 0 3 23 8 9 24 7 7 25 8 1 26 5 8 27 9 6 28 2 8 29 9 0 In [314]: r = df.groupby('a').apply(lambda x: x.b.mean() if len(x)>=5 else -1) In [315]: r Out[315]: a 0 -1.000000 1 -1.000000 2 -1.000000 3 -1.000000 4 -1.000000 5 -1.000000 6 -1.000000 7 4.600000 8 3.857143 9 -1.000000 dtype: float64 In [316]: r[r>0] Out[316]: a 7 4.600000 8 3.857143 dtype: float64 </code></pre> One-liner, which returns data frame instead of series: <pre class="prettyprint"><code>df.groupby('a') \ .apply(lambda x: x.b.mean() if len(x)>=5 else -1) \ .to_frame() \ .rename(columns={0:'mean'}) \ .query('mean > 0') </code></pre> Timeit comparison against a DF with 100.000 rows: <pre class="prettyprint"><code>def maxu(): r = df.groupby('a').apply(lambda x: x.b.mean() if len(x)>=5 else -1) return r[r>0] def maxu2(): return df.groupby('a') \ .apply(lambda x: x.b.mean() if len(x)>=5 else -1) \ .to_frame() \ .rename(columns={0:'mean'}) \ .query('mean > 0') def alexander(): return df.groupby('a', as_index=False).filter(lambda group: group.a.count() >= 5).groupby('a').mean() def alexander2(): vc = df.a.value_counts() return df.loc[df.a.isin(vc[vc >= 5].index)].groupby('a').mean() </code></pre> Results: <pre class="prettyprint"><code>In [419]: %timeit maxu() 1 loop, best of 3: 1.12 s per loop In [420]: %timeit maxu2() 1 loop, best of 3: 1.12 s per loop In [421]: %timeit alexander() 1 loop, best of 3: 34.9 s per loop In [422]: %timeit alexander2() 10 loops, best of 3: 66.6 ms per loop </code></pre> Check: <pre class="prettyprint"><code>In [423]: alexander2().sum() Out[423]: b 19220943.162 dtype: float64 In [424]: maxu2().sum() Out[424]: mean 19220943.162 dtype: float64 </code></pre> Conclusion: clear winner is <code>alexander2()</code> function @Alexander, congratulations!

Here is some reproduceable data: <pre class="prettyprint"><code>np.random.seed(0) df = pd.DataFrame(np.random.randint(0, 10, (10, 2)), columns=list('AB')) >>> df A B 0 5 0 1 3 3 2 7 9 3 3 5 4 2 4 5 7 6 6 8 8 7 1 6 8 7 7 9 8 1 </code></pre> A sample filter application demonstrating that it works on the data. <pre class="prettyprint"><code>gb = df.groupby('A') >>> gb.filter(lambda group: group.A.count() >= 3) A B 2 7 9 5 7 6 8 7 7 </code></pre> Here are some of your options: 1) You can also first filter based on the value counts, and then group. <pre class="prettyprint"><code>vc = df.A.value_counts() >>> df.loc[df.A.isin(vc[vc >= 2].index)].groupby('A').mean() B A 3 4.000000 7 7.333333 8 4.500000 </code></pre> 2) Perform groupby twice, before and after the filter: <pre class="prettyprint"><code>>>> (df.groupby('A', as_index=False) .filter(lambda group: group.A.count() >= 2) .groupby('A') .mean()) B A 3 4.000000 7 7.333333 8 4.500000 </code></pre> 3) Given that your first groupby returns the groups, you can also filter on those: <pre class="prettyprint"><code>d = {k: v for k, v in df.groupby('A').groups.items() if len(v) >= 2} # gb.groups.iteritems() for Python 2 >>> d {3: [1, 3], 7: [2, 5, 8], 8: [6, 9]} </code></pre> This is a bit of a hack, but should be relatively efficient as you don't need to regroup. <pre class="prettyprint"><code>>>> pd.DataFrame({col: [df.ix[d[col], 'B'].mean()] for col in d}).T.rename(columns={0: 'B'}) B 3 4.000000 7 7.333333 8 4.500000 </code></pre> Timings with 100k rows <pre class="prettyprint"><code>np.random.seed(0) df = pd.DataFrame(np.random.randint(0, 10, (100000, 2)), columns=list('AB')) %timeit df.groupby('A', as_index=False).filter(lambda group: group['A'].count() >= 5).groupby('A').mean() 100 loops, best of 3: 18 ms per loop %%timeit vc = df.A.value_counts() df.loc[df.A.isin(vc[vc >= 2].index)].groupby('A').mean() 100 loops, best of 3: 15.7 ms per loop </code></pre>

Chaining grouping, filtration and aggregation

Tags:

python

python-3.x

pandas

dataframe

grouping

DataFrameGroupby.filter method filters the groups, and returns the DataFrame that contains the rows that passed the filter.

But what can I do to obtain a new DataFrameGroupBy object instead of a DataFrame after filtration?

For example, let's say I have a DataFrame df with two columns A and B. I want to obtain average value of column B for each value of column A, as long as there's at least 5 rows in that group:

# pandas 0.18.0
# doesn't work because `filter` returns a DF not a GroupBy object
df.groupby('A').filter(lambda x: len(x)>=5).mean()
# works but slower and awkward to write because needs to groupby('A') twice
df.groupby('A').filter(lambda x: len(x)>=5).reset_index().groupby('A').mean()
# works but more verbose than chaining
groups = df.groupby('A')
groups.mean()[groups.size() >= 5]

879

asked Apr 03 '16 19:04

max

2 Answers

You can do it this way:

In [310]: df
Out[310]:
    a  b
0   1  4
1   7  3
2   6  9
3   4  4
4   0  2
5   8  4
6   7  7
7   0  5
8   8  5
9   8  7
10  6  1
11  3  8
12  7  4
13  8  0
14  5  3
15  5  3
16  8  1
17  7  2
18  9  9
19  3  2
20  9  1
21  1  2
22  0  3
23  8  9
24  7  7
25  8  1
26  5  8
27  9  6
28  2  8
29  9  0

In [314]: r = df.groupby('a').apply(lambda x: x.b.mean() if len(x)>=5 else -1)

In [315]: r
Out[315]:
a
0   -1.000000
1   -1.000000
2   -1.000000
3   -1.000000
4   -1.000000
5   -1.000000
6   -1.000000
7    4.600000
8    3.857143
9   -1.000000
dtype: float64

In [316]: r[r>0]
Out[316]:
a
7    4.600000
8    3.857143
dtype: float64

One-liner, which returns data frame instead of series:

df.groupby('a') \
  .apply(lambda x: x.b.mean() if len(x)>=5 else -1) \
  .to_frame() \
  .rename(columns={0:'mean'}) \
  .query('mean > 0')

Timeit comparison against a DF with 100.000 rows:

def maxu():
    r = df.groupby('a').apply(lambda x: x.b.mean() if len(x)>=5 else -1)
    return r[r>0]

def maxu2():
    return df.groupby('a') \
             .apply(lambda x: x.b.mean() if len(x)>=5 else -1) \
             .to_frame() \
             .rename(columns={0:'mean'}) \
             .query('mean > 0')

def alexander():
    return df.groupby('a', as_index=False).filter(lambda group: group.a.count() >= 5).groupby('a').mean()

def alexander2():
    vc = df.a.value_counts()
    return df.loc[df.a.isin(vc[vc >= 5].index)].groupby('a').mean()

Results:

In [419]: %timeit maxu()
1 loop, best of 3: 1.12 s per loop

In [420]: %timeit maxu2()
1 loop, best of 3: 1.12 s per loop

In [421]: %timeit alexander()
1 loop, best of 3: 34.9 s per loop

In [422]: %timeit alexander2()
10 loops, best of 3: 66.6 ms per loop

Check:

In [423]: alexander2().sum()
Out[423]:
b   19220943.162
dtype: float64

In [424]: maxu2().sum()
Out[424]:
mean   19220943.162
dtype: float64

Conclusion:

clear winner is alexander2() function

@Alexander, congratulations!

102

answered Sep 28 '22 13:09

MaxU - stop WAR against UA

Here is some reproduceable data:

np.random.seed(0)

df = pd.DataFrame(np.random.randint(0, 10, (10, 2)), columns=list('AB'))

>>> df
   A  B
0  5  0
1  3  3
2  7  9
3  3  5
4  2  4
5  7  6
6  8  8
7  1  6
8  7  7
9  8  1

A sample filter application demonstrating that it works on the data.

gb = df.groupby('A')
>>> gb.filter(lambda group: group.A.count() >= 3)
   A  B
2  7  9
5  7  6
8  7  7

Here are some of your options:

1) You can also first filter based on the value counts, and then group.

vc = df.A.value_counts()

>>> df.loc[df.A.isin(vc[vc >= 2].index)].groupby('A').mean()
          B
A          
3  4.000000
7  7.333333
8  4.500000

2) Perform groupby twice, before and after the filter:

>>> (df.groupby('A', as_index=False)
       .filter(lambda group: group.A.count() >= 2)
       .groupby('A')
       .mean())
          B
A          
3  4.000000
7  7.333333
8  4.500000

3) Given that your first groupby returns the groups, you can also filter on those:

d = {k: v 
     for k, v in df.groupby('A').groups.items() 
     if len(v) >= 2}  # gb.groups.iteritems() for Python 2

>>> d
{3: [1, 3], 7: [2, 5, 8], 8: [6, 9]}

This is a bit of a hack, but should be relatively efficient as you don't need to regroup.

>>> pd.DataFrame({col: [df.ix[d[col], 'B'].mean()] for col in d}).T.rename(columns={0: 'B'})
          B
3  4.000000
7  7.333333
8  4.500000

Timings with 100k rows

np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 10, (100000, 2)), columns=list('AB'))

%timeit df.groupby('A', as_index=False).filter(lambda group: group['A'].count() >= 5).groupby('A').mean()
100 loops, best of 3: 18 ms per loop

%%timeit
vc = df.A.value_counts()
df.loc[df.A.isin(vc[vc >= 2].index)].groupby('A').mean()
100 loops, best of 3: 15.7 ms per loop

answered Sep 28 '22 11:09

Alexander

Related questions
                            
                                Gaussian Process scikit-learn - Exception
                            
                                cnf argument for tkinter widgets
                            
                                Certificate verification when using virtual environments
                            
                                How to display Model data in a Django Template
                            
                                How to debug OOP Class/Method in Python
                            
                                Importing and decoding dataset in xarray to avoid conflicting _FillValue and missing_value
                            
                                Python mysql connector returns tuple
                            
                                Error - This version of Visual Studio is unable to open the following projects
                            
                                Why does my function overwrite a list passed as a parameter?
                            
                                Send values to Python coroutine without handling StopIteration
                            
                                Selenium python: Can not connect to the Service %s" % self.path
                            
                                Pickle figures from matplotlib
                            
                                How to write Chinese characters to file by python
                            
                                Falcon through Waitress on Windows OS
                            
                                Append new row when using pandas iterrows()?
                            
                                Pandas read_sql columns not working when using index_col - returns all columns instead
                            
                                List of supported kernels for KernelRidge estimator [closed]
                            
                                Understanding threading
                            
                                Flask app.config during unit testing
                            
                                What is "|safe" in odoo email template in field email_from

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With