DataFrameGroupby.filter
method filters the groups, and returns the DataFrame
that contains the rows that passed the filter.
But what can I do to obtain a new DataFrameGroupBy
object instead of a DataFrame
after filtration?
For example, let's say I have a DataFrame
df
with two columns A
and B
. I want to obtain average value of column B
for each value of column A
, as long as there's at least 5 rows in that group:
# pandas 0.18.0
# doesn't work because `filter` returns a DF not a GroupBy object
df.groupby('A').filter(lambda x: len(x)>=5).mean()
# works but slower and awkward to write because needs to groupby('A') twice
df.groupby('A').filter(lambda x: len(x)>=5).reset_index().groupby('A').mean()
# works but more verbose than chaining
groups = df.groupby('A')
groups.mean()[groups.size() >= 5]
Data aggregation and grouping allows us to create summaries for display or analysis, for example, when calculating average values or creating a table of counts or sums. It is a process that follows the split-apply-combine strategy: Split data into groups based on some criteria.
Instead of using groupby aggregation together, we can perform groupby without aggregation which is applicable to aggregate data separately.
groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names.
You can do it this way:
In [310]: df
Out[310]:
a b
0 1 4
1 7 3
2 6 9
3 4 4
4 0 2
5 8 4
6 7 7
7 0 5
8 8 5
9 8 7
10 6 1
11 3 8
12 7 4
13 8 0
14 5 3
15 5 3
16 8 1
17 7 2
18 9 9
19 3 2
20 9 1
21 1 2
22 0 3
23 8 9
24 7 7
25 8 1
26 5 8
27 9 6
28 2 8
29 9 0
In [314]: r = df.groupby('a').apply(lambda x: x.b.mean() if len(x)>=5 else -1)
In [315]: r
Out[315]:
a
0 -1.000000
1 -1.000000
2 -1.000000
3 -1.000000
4 -1.000000
5 -1.000000
6 -1.000000
7 4.600000
8 3.857143
9 -1.000000
dtype: float64
In [316]: r[r>0]
Out[316]:
a
7 4.600000
8 3.857143
dtype: float64
One-liner, which returns data frame instead of series:
df.groupby('a') \
.apply(lambda x: x.b.mean() if len(x)>=5 else -1) \
.to_frame() \
.rename(columns={0:'mean'}) \
.query('mean > 0')
Timeit comparison against a DF with 100.000 rows:
def maxu():
r = df.groupby('a').apply(lambda x: x.b.mean() if len(x)>=5 else -1)
return r[r>0]
def maxu2():
return df.groupby('a') \
.apply(lambda x: x.b.mean() if len(x)>=5 else -1) \
.to_frame() \
.rename(columns={0:'mean'}) \
.query('mean > 0')
def alexander():
return df.groupby('a', as_index=False).filter(lambda group: group.a.count() >= 5).groupby('a').mean()
def alexander2():
vc = df.a.value_counts()
return df.loc[df.a.isin(vc[vc >= 5].index)].groupby('a').mean()
Results:
In [419]: %timeit maxu()
1 loop, best of 3: 1.12 s per loop
In [420]: %timeit maxu2()
1 loop, best of 3: 1.12 s per loop
In [421]: %timeit alexander()
1 loop, best of 3: 34.9 s per loop
In [422]: %timeit alexander2()
10 loops, best of 3: 66.6 ms per loop
Check:
In [423]: alexander2().sum()
Out[423]:
b 19220943.162
dtype: float64
In [424]: maxu2().sum()
Out[424]:
mean 19220943.162
dtype: float64
Conclusion:
clear winner is alexander2()
function
@Alexander, congratulations!
Here is some reproduceable data:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 10, (10, 2)), columns=list('AB'))
>>> df
A B
0 5 0
1 3 3
2 7 9
3 3 5
4 2 4
5 7 6
6 8 8
7 1 6
8 7 7
9 8 1
A sample filter application demonstrating that it works on the data.
gb = df.groupby('A')
>>> gb.filter(lambda group: group.A.count() >= 3)
A B
2 7 9
5 7 6
8 7 7
Here are some of your options:
1) You can also first filter based on the value counts, and then group.
vc = df.A.value_counts()
>>> df.loc[df.A.isin(vc[vc >= 2].index)].groupby('A').mean()
B
A
3 4.000000
7 7.333333
8 4.500000
2) Perform groupby twice, before and after the filter:
>>> (df.groupby('A', as_index=False)
.filter(lambda group: group.A.count() >= 2)
.groupby('A')
.mean())
B
A
3 4.000000
7 7.333333
8 4.500000
3) Given that your first groupby returns the groups, you can also filter on those:
d = {k: v
for k, v in df.groupby('A').groups.items()
if len(v) >= 2} # gb.groups.iteritems() for Python 2
>>> d
{3: [1, 3], 7: [2, 5, 8], 8: [6, 9]}
This is a bit of a hack, but should be relatively efficient as you don't need to regroup.
>>> pd.DataFrame({col: [df.ix[d[col], 'B'].mean()] for col in d}).T.rename(columns={0: 'B'})
B
3 4.000000
7 7.333333
8 4.500000
Timings with 100k rows
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 10, (100000, 2)), columns=list('AB'))
%timeit df.groupby('A', as_index=False).filter(lambda group: group['A'].count() >= 5).groupby('A').mean()
100 loops, best of 3: 18 ms per loop
%%timeit
vc = df.A.value_counts()
df.loc[df.A.isin(vc[vc >= 2].index)].groupby('A').mean()
100 loops, best of 3: 15.7 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With