Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Chaining grouping, filtration and aggregation

DataFrameGroupby.filter method filters the groups, and returns the DataFrame that contains the rows that passed the filter.

But what can I do to obtain a new DataFrameGroupBy object instead of a DataFrame after filtration?

For example, let's say I have a DataFrame df with two columns A and B. I want to obtain average value of column B for each value of column A, as long as there's at least 5 rows in that group:

# pandas 0.18.0
# doesn't work because `filter` returns a DF not a GroupBy object
df.groupby('A').filter(lambda x: len(x)>=5).mean()
# works but slower and awkward to write because needs to groupby('A') twice
df.groupby('A').filter(lambda x: len(x)>=5).reset_index().groupby('A').mean()
# works but more verbose than chaining
groups = df.groupby('A')
groups.mean()[groups.size() >= 5]
like image 879
max Avatar asked Apr 03 '16 19:04

max


People also ask

What is aggregation and grouping?

Data aggregation and grouping allows us to create summaries for display or analysis, for example, when calculating average values or creating a table of counts or sums. It is a process that follows the split-apply-combine strategy: Split data into groups based on some criteria.

Can I use group by without aggregate function pandas?

Instead of using groupby aggregation together, we can perform groupby without aggregation which is applicable to aggregate data separately.

What does the DataFrame Groupby operation perform?

groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names.


2 Answers

You can do it this way:

In [310]: df
Out[310]:
    a  b
0   1  4
1   7  3
2   6  9
3   4  4
4   0  2
5   8  4
6   7  7
7   0  5
8   8  5
9   8  7
10  6  1
11  3  8
12  7  4
13  8  0
14  5  3
15  5  3
16  8  1
17  7  2
18  9  9
19  3  2
20  9  1
21  1  2
22  0  3
23  8  9
24  7  7
25  8  1
26  5  8
27  9  6
28  2  8
29  9  0

In [314]: r = df.groupby('a').apply(lambda x: x.b.mean() if len(x)>=5 else -1)

In [315]: r
Out[315]:
a
0   -1.000000
1   -1.000000
2   -1.000000
3   -1.000000
4   -1.000000
5   -1.000000
6   -1.000000
7    4.600000
8    3.857143
9   -1.000000
dtype: float64

In [316]: r[r>0]
Out[316]:
a
7    4.600000
8    3.857143
dtype: float64

One-liner, which returns data frame instead of series:

df.groupby('a') \
  .apply(lambda x: x.b.mean() if len(x)>=5 else -1) \
  .to_frame() \
  .rename(columns={0:'mean'}) \
  .query('mean > 0')

Timeit comparison against a DF with 100.000 rows:

def maxu():
    r = df.groupby('a').apply(lambda x: x.b.mean() if len(x)>=5 else -1)
    return r[r>0]

def maxu2():
    return df.groupby('a') \
             .apply(lambda x: x.b.mean() if len(x)>=5 else -1) \
             .to_frame() \
             .rename(columns={0:'mean'}) \
             .query('mean > 0')

def alexander():
    return df.groupby('a', as_index=False).filter(lambda group: group.a.count() >= 5).groupby('a').mean()

def alexander2():
    vc = df.a.value_counts()
    return df.loc[df.a.isin(vc[vc >= 5].index)].groupby('a').mean()

Results:

In [419]: %timeit maxu()
1 loop, best of 3: 1.12 s per loop

In [420]: %timeit maxu2()
1 loop, best of 3: 1.12 s per loop

In [421]: %timeit alexander()
1 loop, best of 3: 34.9 s per loop

In [422]: %timeit alexander2()
10 loops, best of 3: 66.6 ms per loop

Check:

In [423]: alexander2().sum()
Out[423]:
b   19220943.162
dtype: float64

In [424]: maxu2().sum()
Out[424]:
mean   19220943.162
dtype: float64

Conclusion:

clear winner is alexander2() function

@Alexander, congratulations!

like image 102
MaxU - stop WAR against UA Avatar answered Sep 28 '22 13:09

MaxU - stop WAR against UA


Here is some reproduceable data:

np.random.seed(0)

df = pd.DataFrame(np.random.randint(0, 10, (10, 2)), columns=list('AB'))

>>> df
   A  B
0  5  0
1  3  3
2  7  9
3  3  5
4  2  4
5  7  6
6  8  8
7  1  6
8  7  7
9  8  1

A sample filter application demonstrating that it works on the data.

gb = df.groupby('A')
>>> gb.filter(lambda group: group.A.count() >= 3)
   A  B
2  7  9
5  7  6
8  7  7

Here are some of your options:

1) You can also first filter based on the value counts, and then group.

vc = df.A.value_counts()

>>> df.loc[df.A.isin(vc[vc >= 2].index)].groupby('A').mean()
          B
A          
3  4.000000
7  7.333333
8  4.500000

2) Perform groupby twice, before and after the filter:

>>> (df.groupby('A', as_index=False)
       .filter(lambda group: group.A.count() >= 2)
       .groupby('A')
       .mean())
          B
A          
3  4.000000
7  7.333333
8  4.500000

3) Given that your first groupby returns the groups, you can also filter on those:

d = {k: v 
     for k, v in df.groupby('A').groups.items() 
     if len(v) >= 2}  # gb.groups.iteritems() for Python 2

>>> d
{3: [1, 3], 7: [2, 5, 8], 8: [6, 9]}

This is a bit of a hack, but should be relatively efficient as you don't need to regroup.

>>> pd.DataFrame({col: [df.ix[d[col], 'B'].mean()] for col in d}).T.rename(columns={0: 'B'})
          B
3  4.000000
7  7.333333
8  4.500000

Timings with 100k rows

np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 10, (100000, 2)), columns=list('AB'))

%timeit df.groupby('A', as_index=False).filter(lambda group: group['A'].count() >= 5).groupby('A').mean()
100 loops, best of 3: 18 ms per loop

%%timeit
vc = df.A.value_counts()
df.loc[df.A.isin(vc[vc >= 2].index)].groupby('A').mean()
100 loops, best of 3: 15.7 ms per loop
like image 36
Alexander Avatar answered Sep 28 '22 11:09

Alexander