Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Get highest value from a column for each unique value in another column

Tags:

python

pandas

How to get the highest value in one column for each unique value in another column and return the same dataframe structure back. Here is a pandas dataframe example?

       reg.nr   counter      value     ID2  categ             date
1       37367       421     231385      93      A       20.01.2004
2       37368       428     235156      93      B       21.01.2004
3       37369       408     234251      93      C       22.01.2004
4       37372       403     196292      93      D       23.01.2004
5       55523       400     247141      139     E       24.01.2004
6       55575       415     215818      139     F       25.01.2004
7       55576       402     204404      139     A       26.01.2004
8       69940       402     62244       175     B       27.01.2004
9       69941       402     38274       175     C       28.01.2004
10      69942       404     55171       175     D       29.01.2004
11      69943       416     55495       175     E       30.01.2004
12      69944       407     90231       175     F       31.01.2004
13      69945       411     75382       175     A       01.02.2004
14      69948       405     119129      175     B       02.02.2004

Where i want to return the highest value of column "counter" based on the unique value of column "ID2". After the new pandas dataframe should look like this:

       reg.nr   counter      value     ID2  categ             date
1       37368       428     235156      93      B       21.01.2004
2       55575       415     215818      139     F       25.01.2004
3       69943       416     55495       175     E       30.01.2004
like image 321
RaduS Avatar asked Feb 05 '17 09:02

RaduS


2 Answers

One way using drop_duplicates

In [332]: df.sort_values('counter', ascending=False).drop_duplicates(['ID2'])
Out[332]:
    reg.nr  counter   value  ID2 categ        date
2    37368      428  235156   93     B  21.01.2004
11   69943      416   55495  175     E  30.01.2004
6    55575      415  215818  139     F  25.01.2004

For desired output, you could sort on two columns, and reset the index

In [336]: (df.sort_values(['ID2', 'counter'], ascending=[True, False])
             .drop_duplicates(['ID2']).reset_index(drop=True)
          )
Out[336]:
   reg.nr  counter   value  ID2 categ        date
0   37368      428  235156   93     B  21.01.2004
1   55575      415  215818  139     F  25.01.2004
2   69943      416   55495  175     E  30.01.2004
like image 125
Zero Avatar answered Sep 20 '22 10:09

Zero


df.loc[df.groupby('ID2')['counter'].idxmax(), :].reset_index()

   index  reg.nr  counter   value  ID2 categ        date
0      2   37368      428  235156   93     B  21.01.2004
1      6   55575      415  215818  139     F  25.01.2004
2     11   69943      416   55495  175     E  30.01.2004

First, you are grouping your dataframe by column ID2. Then you get counter column and calculate an index of (first) maximal element of this column in each group. Then you use these indexes to filter your initial dataframe. Finally you resets indexes (if you need it).

like image 22
Ilya V. Schurov Avatar answered Sep 20 '22 10:09

Ilya V. Schurov