Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas aggregate -- how to retain all columns

Example dataframe:

rand = np.random.RandomState(1)
df = pd.DataFrame({'A': ['group1', 'group2', 'group3'] * 2,
                'B': rand.rand(6),
                'C': rand.rand(6),
                'D': rand.rand(6)})

print df

        A         B         C         D
0  group1  0.417022  0.186260  0.204452
1  group2  0.720324  0.345561  0.878117
2  group3  0.000114  0.396767  0.027388
3  group1  0.302333  0.538817  0.670468
4  group2  0.146756  0.419195  0.417305
5  group3  0.092339  0.685220  0.558690

Groupby column A

group = df.groupby('A')

Use agg to return max value for each group

max1 = group['B'].agg({'max' : np.max})
print max1

             max
A               
group1  0.417022
group2  0.720324
group3  0.092339

But I would like to retain (or get back) the appropriate data in the other columns, C and D. This would be the remaining data for the row which contained the max value. So, the return should be:

     A         B         C         D
group1  0.417022  0.186260  0.204452
group2  0.720324  0.345561  0.878117
group3  0.092339  0.685220  0.558690

Can anybody show how to do this? Any help appreciated.

like image 256
rdh9 Avatar asked Aug 19 '14 13:08

rdh9


People also ask

Does pandas GroupBy preserve order?

Groupby preserves the order of rows within each group.

How do I keep a GroupBy column?

Grouping by a DataFrame and keeping columns involves four steps: get the data, split it, apply a function, and combine the result. It would be best to learn data grouping in Pandas before seeing practical examples.

How do you aggregate multiple columns in Python?

To apply aggregations to multiple columns, just add additional key:value pairs to the dictionary. Applying multiple aggregation functions to a single column will result in a multiindex. Working with multi-indexed columns is a pain and I'd recommend flattening this after aggregating by renaming the new columns.


2 Answers

Two stages: first find indices, then lookup all the rows.

idx = df.groupby('A').apply(lambda x: x['B'].argmax())
idx

Out[362]: 
A
group1    0
group2    1
group3    5

df.loc[idx]

Out[364]: 
        A         B         C         D
0  group1  0.417022  0.186260  0.204452
1  group2  0.720324  0.345561  0.878117
5  group3  0.092339  0.685220  0.558690
like image 107
FooBar Avatar answered Sep 29 '22 07:09

FooBar


My answer is similar to FooBar but is done in one line by using idmax()

df.loc[df.groupby('A')['B'].idxmax()]

Result is the same:

In [51]: df
Out[51]: 
        A         B         C         D
0  group1  0.417022  0.186260  0.204452
1  group2  0.720324  0.345561  0.878117
2  group3  0.000114  0.396767  0.027388
3  group1  0.302333  0.538817  0.670468
4  group2  0.146756  0.419195  0.417305
5  group3  0.092339  0.685220  0.558690

In [76]: df.loc[df.groupby('A')['B'].idxmax()]
Out[76]: 
        A         B         C         D
0  group1  0.417022  0.186260  0.204452
1  group2  0.720324  0.345561  0.878117
5  group3  0.092339  0.685220  0.558690
like image 41
phi-j Avatar answered Sep 29 '22 06:09

phi-j