Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keep a column with a categorical variable in Pandas with groupby and mean()

Is there a way to keep the categorical variable after groupby and mean()? For example, given the dataframe df:

              ratio    Metadata_A      Metadata_B   treatment
0      54265.937500           B10               1  AB_cmpd_01
11    107364.750000           B10               2  AB_cmpd_01
22     95766.500000           B10               3  AB_cmpd_01
24     64346.250000           B10               4  AB_cmpd_01
25     52726.333333           B10               5  AB_cmpd_01
30     65056.600000           B11               1          UT
41     78409.600000           B11               2          UT
52    133533.000000           B11               3          UT
54    102433.571429           B11               4          UT
55     82217.588235           B11               5          UT
60     89843.600000            B2               1          UT
71     98544.000000            B2               2          UT
82    179330.000000            B2               3          UT
84    107132.400000            B2               4          UT
85     73096.909091            B2               5          UT

I need to average over ratio within each of Metadata_A, but at the end to keep the column treatment:

Theoretically, something like:

df.groupby(by='Metadata_A').mean().reset_index()

              ratio    Metadata_A      Metadata_B   treatment
 0     54265.937500           B10             2.5  AB_cmpd_01
 1     78409.600000           B11             2.5          UT
 2    107132.400000            B2             2.5          UT

However, the column treatment disappears after the averaging.

like image 775
Arnold Klein Avatar asked Apr 26 '18 22:04

Arnold Klein


Video Answer


2 Answers

You can using groupby with agg

df.groupby(['Metadata_A','treatment'],as_index=False).agg({'Metadata_B':'mean','ratio':'first'})
Out[358]: 
  Metadata_A   treatment  Metadata_B       ratio
0        B10  AB_cmpd_01           3  54265.9375
1        B11          UT           3  65056.6000
2         B2          UT           3  89843.6000
like image 71
BENY Avatar answered Sep 19 '22 19:09

BENY


The issue is that pandas doesn't know how to take the mean of treatment, as these are strings. One solution would be to get your means using groupby('Metadata_A'), then merge those values with the original dataframe, and then groupby('Metadata_A') again:

# Get your means:
grp = df.groupby('Metadata_A').mean().reset_index()

# Merge those with the original `dataframe`, getting rid of extra columns
(df.merge(grp, on = ['Metadata_A'], suffixes=('', '_mean'))
 .drop(['Metadata_B', 'ratio'], axis=1)
 .groupby('Metadata_A')
 .first()
 .reset_index()
)

Which returns:

  Metadata_A   treatment     ratio_mean  Metadata_B_mean
0        B10  AB_cmpd_01   74893.954167                3
1        B11          UT   92330.071933                3
2         B2          UT  109589.381818                3

Edit @Wen's method of grouping by treatment and Metadata_A makes a lot more sense than what I just described. If you're looking for the means of both columns, you can just do:

df.groupby(['Metadata_A', 'treatment']).mean().reset_index()
like image 44
sacuL Avatar answered Sep 20 '22 19:09

sacuL