Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use split-apply-combine pattern of pandas groupby() to normalize multiple columns simultaneously

I am trying to normalize experimental data in a pandas data table that contains multiple columns with numerical observables (features), columns with date and experiment conditions as well as additional non-numerical conditions such as filenames.

I would like to

  • use the split-apply-combine paradigm
  • normalize within groups, using aggregate statistics of subgroups
  • use different normalizations (e.g. divide-by-control-mean, Z-score)
  • apply this to all numerical columns (observables)
  • finally, generate an augmented data table which has the same structure as the original, but with additional columns, e.g. for column Observable1 a column normalized_Observable1 should be added

A simplified data table that has this structure can be generated with this code snippet::

import numpy as np
import pandas as pd
df = pd.DataFrame({
   'condition': ['ctrl', 'abc', 'ctrl', 'abc', 'def', 'ctlr', 'ctlr', 'asdasd', 'afff', 'afff', 'gr1','gr2', 'gr2', 'ctrl', 'ctrl', 'kjkj','asht','ctrl'],
   'date':  ['20170131', '20170131', '20170131', '20170131','20170131', '20170606', '20170606', '20170606', '20170606', '20170606', '20170404', '20170404', '20170404', '20170404', '20170404', '20161212', '20161212', '20161212'],
   'observation1':  [1.2, 2.2, 1.3, 1.1, 2.3 , 2.3, 4.2, 3.3, 5.1, 3.3, 3.4, 5.5, 9.9, 3.2, 1.1, 3.3, 1.2, 5.4],
   'observation2':  [3.1, 2.2, 2.1, 1.2,  2.4, 1.2, 1.5, 1.33, 1.5, 1.6, 1.4, 1.3, 0.9, 0.78, 1.2, 4.0, 5.0, 6.0],
   'observation3':  [2.0, 1.2, 1.2, 2.01, 2.55, 2.05, 1.66, 3.2, 3.21, 3.04, 8.01, 9.1, 7.06, 8.1, 7.9, 5.12, 5.23, 5.15],
   'rawsource': ["1.tif", "2.tif", "3.tif",  "4.tif", "5.tif","6.tif", "7.tif", "8.tif", "9.tif", "10.tif", "11.tif", "12.tif", "13.tif", "14.tif", "15.tif", "16.tif", "17.tif", "18.tif"]
})
print(df)

and would look like this

   condition      date  observation1  observation2  observation3 rawsource
0       ctrl  20170131           1.2          3.10          2.00     1.tif
1        abc  20170131           2.2          2.20          1.20     2.tif
2       ctrl  20170131           1.3          2.10          1.20     3.tif
3        abc  20170131           1.1          1.20          2.01     4.tif
4        def  20170131           2.3          2.40          2.55     5.tif
5       ctlr  20170606           2.3          1.20          2.05     6.tif
6       ctlr  20170606           4.2          1.50          1.66     7.tif
7     asdasd  20170606           3.3          1.33          3.20     8.tif
8       afff  20170606           5.1          1.50          3.21     9.tif
9       afff  20170606           3.3          1.60          3.04    10.tif
10       gr1  20170404           3.4          1.40          8.01    11.tif
11       gr2  20170404           5.5          1.30          9.10    12.tif
12       gr2  20170404           9.9          0.90          7.06    13.tif
13      ctrl  20170404           3.2          0.78          8.10    14.tif
14      ctrl  20170404           1.1          1.20          7.90    15.tif
15      kjkj  20161212           3.3          4.00          5.12    16.tif
16      asht  20161212           1.2          5.00          5.23    17.tif
17      ctrl  20161212           5.4          6.00          5.15    18.tif

Now, for each experiment date I have different experimental conditions, but I always have condition named ctrl. One of normalizations I'd like to perform is to calculate the (for each numerical column) the mean of the control experiment for that date and then divide all observables from that date by their corresponding mean.

I can quickly calculate some of the per-date, per-condition summary statiscs using:

grsummary = df.groupby(["date","condition"]).agg((min, max, np.nanmean, np.nanstd))

Then I would like to apply these summary statistics in a normalization for each experiment date:

grdate = df.groupby("date")

and apply the normalization in a fashion like this:

def normalize_by_ctrlmean(grp_frame, summarystats):
    #  the following is only pseudo-code as I don't know how to do this
    grp_frame/ summarystats(nanmean)

grdate.apply(normalize_by_cntrlmean, summarystats= grsummary)

The last step is only pseudo-code. This is what I'm struggling with. I could do the normalization using nested for-loops over dates, conditions, and column names of numerical columns but I'm new to the split-apply-combine paradigm and I think there must be a simple solution ? Any help greatly appreciated.

like image 717
volkerH Avatar asked Jan 04 '23 18:01

volkerH


2 Answers

Here's how you can do this using df.apply:


Split

Since you want to perform operations 'per date', you only need to split by date:

grdate = df.groupby("date")


Apply & Combine

Next, define a transformation function that can be applied to each group, taking the group itself as an argument.

In your case, the function should compute the mean of the group's ctrl values and then divide all of the group's observations by this mean:

def norm_apply(group):

    # Select the 'ctrl' condition
    ctrl_selected = group[group['condition']=='ctrl']

    # Extract its numerical values
    ctrl_numeric = ctrl_selected.select_dtypes(include=[np.number])

    # Compute the means (column-wise)
    ctrl_means = np.nanmean(ctrl_numeric,axis=0) 

    # Extract numerical values for all conditions
    group_numeric = group.select_dtypes(include=[np.number])

    # Divide by the ctrl means
    divided = group_numeric / ctrl_means

    # Return result
    return divided

(You can do this as a silly one-liner, if you wish...)

norm_apply = lambda x : x.select_dtypes(include=[np.number]) / np.nanmean(x[x['condition']=='ctrl'].select_dtypes(include=[np.number]),axis=0)


Now you can simply apply this function to your grouped dataframe:

normed = grdate.apply(norm_apply)

This should give you the values you need, combined into the same shape/order as your original df:

normed.head()

>>   observation1  observation2  observation3
0          0.96      1.192308       1.25000
1          1.76      0.846154       0.75000
2          1.04      0.807692       0.75000
3          0.88      0.461538       1.25625
4          1.84      0.923077       1.59375


Merge into the Original DataFrame

One way of adding these results back to the original df would be like this:

# Add prefix to column names
normed = normed.add_prefix('normed_')

# Concatenate with initial data frame
final = pd.concat([df,normed],axis=1)
display(final.head())


And finally, you can group by date and condition and look at the means:

final.groupby(['date','condition']).mean()

If everything worked correctly, the means for the ctlr condition should all be 1.0.



(Side Note: Although Ian Thompson's answer also works, I believe this approach sticks more closely to the split-apply-combine ideology.)

like image 96
WhoIsJack Avatar answered Jan 16 '23 21:01

WhoIsJack


I'm kind of confused by what you are wanting with the function. I don't have enough reputation to comment so I will give my best guess to try and answer your question.

Seeing that your function is called normalize_by_ctrlmean, I assume that you want to always divide by the mean of the ctrl group for each year, within each observation. To do this we'll have to tidy up your data a bit using the melt function.

df1 = df.melt(id_vars = ['condition',
                         'date',
                         'rawsource'],
              value_vars = ['observation1',
                            'observation2',
                            'observation3'],
              var_name = 'observations')

df1.head()

melt

Next we will calculate the mean for the ctrl group

ctrl_mean = df1[df1.condition == 'ctrl'].groupby(['date',
                                                  'observations']).agg('mean').reset_index().rename(columns = {'value' : 'ctrl_mean'})

ctrl_mean

ctrl_mean

Merge this dataframe with the melted dataframe.

df2 = df1.merge(ctrl_mean,
                how = 'inner',
                on = ['date',
                      'observations'])

df2.head()

merge

And lastly, divide the value column by the ctrl_mean column and insert into the dataframe.

df2.insert(df2.shape[1],
           'normalize_by_ctrlmean',
           df2.loc[:, 'value'] / df2.loc[:, 'ctrl_mean'])

df2.head()

normalize

Hope this gets you closer to what you were needing.

EDIT

Per your comment I'll show how to get back to a similar dataframe you had with observation columns first with the pivot_table function, and then with the groupby function.

pivot_table

df2.pivot_table(index = ['date', # columns to use as the index
                   'condition',
                   'rawsource'],
          columns = 'observations', # this will make columns out of the values in this column
          values = ['value', # these will be the values in each column
                    'ctrl_mean', # swaplevel swaps the column levels (axis = 1), sort_index sorts and "smooshes" them together
                    'normalize_by_ctrlmean']).swaplevel(axis = 1).sort_index(axis = 1).reset_index() # reset_index so you can refer to specific columns

pivot_table

groupby

df2.groupby(['date', # groupby these columns to make the index
             'condition',
             'rawsource',
             'observations']).agg({'value' : 'max', # take the max of these as the aggregate (there was only one value for each so the max just returns that value)
                                   'ctrl_mean' : 'max', # unstack('observations') makes columns out of the 'observations'
                                   'normalize_by_ctrlmean' : 'max'}).unstack('observations').swaplevel(axis = 1).sort_index(axis = 1).reset_index() # these do the same thing as on the pivot_table example

groupby

In addition, you can remove the swaplevel and sort_index functions to keep the aggregate columns on the top level instead of the observations

swap_sort

like image 41
Ian Thompson Avatar answered Jan 16 '23 20:01

Ian Thompson