I am trying to normalize experimental data in a pandas data table that contains multiple columns with numerical observables (features), columns with date and experiment conditions as well as additional non-numerical conditions such as filenames.
I would like to
A simplified data table that has this structure can be generated with this code snippet::
import numpy as np
import pandas as pd
df = pd.DataFrame({
'condition': ['ctrl', 'abc', 'ctrl', 'abc', 'def', 'ctlr', 'ctlr', 'asdasd', 'afff', 'afff', 'gr1','gr2', 'gr2', 'ctrl', 'ctrl', 'kjkj','asht','ctrl'],
'date': ['20170131', '20170131', '20170131', '20170131','20170131', '20170606', '20170606', '20170606', '20170606', '20170606', '20170404', '20170404', '20170404', '20170404', '20170404', '20161212', '20161212', '20161212'],
'observation1': [1.2, 2.2, 1.3, 1.1, 2.3 , 2.3, 4.2, 3.3, 5.1, 3.3, 3.4, 5.5, 9.9, 3.2, 1.1, 3.3, 1.2, 5.4],
'observation2': [3.1, 2.2, 2.1, 1.2, 2.4, 1.2, 1.5, 1.33, 1.5, 1.6, 1.4, 1.3, 0.9, 0.78, 1.2, 4.0, 5.0, 6.0],
'observation3': [2.0, 1.2, 1.2, 2.01, 2.55, 2.05, 1.66, 3.2, 3.21, 3.04, 8.01, 9.1, 7.06, 8.1, 7.9, 5.12, 5.23, 5.15],
'rawsource': ["1.tif", "2.tif", "3.tif", "4.tif", "5.tif","6.tif", "7.tif", "8.tif", "9.tif", "10.tif", "11.tif", "12.tif", "13.tif", "14.tif", "15.tif", "16.tif", "17.tif", "18.tif"]
})
print(df)
and would look like this
condition date observation1 observation2 observation3 rawsource
0 ctrl 20170131 1.2 3.10 2.00 1.tif
1 abc 20170131 2.2 2.20 1.20 2.tif
2 ctrl 20170131 1.3 2.10 1.20 3.tif
3 abc 20170131 1.1 1.20 2.01 4.tif
4 def 20170131 2.3 2.40 2.55 5.tif
5 ctlr 20170606 2.3 1.20 2.05 6.tif
6 ctlr 20170606 4.2 1.50 1.66 7.tif
7 asdasd 20170606 3.3 1.33 3.20 8.tif
8 afff 20170606 5.1 1.50 3.21 9.tif
9 afff 20170606 3.3 1.60 3.04 10.tif
10 gr1 20170404 3.4 1.40 8.01 11.tif
11 gr2 20170404 5.5 1.30 9.10 12.tif
12 gr2 20170404 9.9 0.90 7.06 13.tif
13 ctrl 20170404 3.2 0.78 8.10 14.tif
14 ctrl 20170404 1.1 1.20 7.90 15.tif
15 kjkj 20161212 3.3 4.00 5.12 16.tif
16 asht 20161212 1.2 5.00 5.23 17.tif
17 ctrl 20161212 5.4 6.00 5.15 18.tif
Now, for each experiment date I have different experimental conditions, but I always have condition named ctrl
. One of normalizations I'd like to perform is to calculate the (for each numerical column) the mean of the control experiment for that date and then divide all observables from that date by their corresponding mean.
I can quickly calculate some of the per-date, per-condition summary statiscs using:
grsummary = df.groupby(["date","condition"]).agg((min, max, np.nanmean, np.nanstd))
Then I would like to apply these summary statistics in a normalization for each experiment date:
grdate = df.groupby("date")
and apply the normalization in a fashion like this:
def normalize_by_ctrlmean(grp_frame, summarystats):
# the following is only pseudo-code as I don't know how to do this
grp_frame/ summarystats(nanmean)
grdate.apply(normalize_by_cntrlmean, summarystats= grsummary)
The last step is only pseudo-code. This is what I'm struggling with. I could do the normalization using nested for-loops over dates, conditions, and column names of numerical columns but I'm new to the split-apply-combine paradigm and I think there must be a simple solution ? Any help greatly appreciated.
Here's how you can do this using df.apply
:
Since you want to perform operations 'per date', you only need to split by date:
grdate = df.groupby("date")
Next, define a transformation function that can be applied to each group, taking the group itself as an argument.
In your case, the function should compute the mean of the group's ctrl
values and then divide all of the group's observations by this mean:
def norm_apply(group):
# Select the 'ctrl' condition
ctrl_selected = group[group['condition']=='ctrl']
# Extract its numerical values
ctrl_numeric = ctrl_selected.select_dtypes(include=[np.number])
# Compute the means (column-wise)
ctrl_means = np.nanmean(ctrl_numeric,axis=0)
# Extract numerical values for all conditions
group_numeric = group.select_dtypes(include=[np.number])
# Divide by the ctrl means
divided = group_numeric / ctrl_means
# Return result
return divided
(You can do this as a silly one-liner, if you wish...)
norm_apply = lambda x : x.select_dtypes(include=[np.number]) / np.nanmean(x[x['condition']=='ctrl'].select_dtypes(include=[np.number]),axis=0)
Now you can simply apply
this function to your grouped dataframe:
normed = grdate.apply(norm_apply)
This should give you the values you need, combined into the same shape/order as your original df:
normed.head()
>> observation1 observation2 observation3
0 0.96 1.192308 1.25000
1 1.76 0.846154 0.75000
2 1.04 0.807692 0.75000
3 0.88 0.461538 1.25625
4 1.84 0.923077 1.59375
One way of adding these results back to the original df would be like this:
# Add prefix to column names
normed = normed.add_prefix('normed_')
# Concatenate with initial data frame
final = pd.concat([df,normed],axis=1)
display(final.head())
And finally, you can group by date and condition and look at the means:
final.groupby(['date','condition']).mean()
If everything worked correctly, the means for the ctlr
condition should all be 1.0
.
(Side Note: Although Ian Thompson's answer also works, I believe this approach sticks more closely to the split-apply-combine ideology.)
I'm kind of confused by what you are wanting with the function. I don't have enough reputation to comment so I will give my best guess to try and answer your question.
Seeing that your function is called normalize_by_ctrlmean
, I assume that you want to always divide by the mean
of the ctrl
group for each year, within each observation. To do this we'll have to tidy up your data a bit using the melt
function.
df1 = df.melt(id_vars = ['condition',
'date',
'rawsource'],
value_vars = ['observation1',
'observation2',
'observation3'],
var_name = 'observations')
df1.head()
Next we will calculate the mean
for the ctrl
group
ctrl_mean = df1[df1.condition == 'ctrl'].groupby(['date',
'observations']).agg('mean').reset_index().rename(columns = {'value' : 'ctrl_mean'})
ctrl_mean
Merge this dataframe with the melted dataframe.
df2 = df1.merge(ctrl_mean,
how = 'inner',
on = ['date',
'observations'])
df2.head()
And lastly, divide the value
column by the ctrl_mean
column and insert into the dataframe.
df2.insert(df2.shape[1],
'normalize_by_ctrlmean',
df2.loc[:, 'value'] / df2.loc[:, 'ctrl_mean'])
df2.head()
Hope this gets you closer to what you were needing.
EDIT
Per your comment I'll show how to get back to a similar dataframe you had with observation
columns first with the pivot_table
function, and then with the groupby
function.
pivot_table
df2.pivot_table(index = ['date', # columns to use as the index
'condition',
'rawsource'],
columns = 'observations', # this will make columns out of the values in this column
values = ['value', # these will be the values in each column
'ctrl_mean', # swaplevel swaps the column levels (axis = 1), sort_index sorts and "smooshes" them together
'normalize_by_ctrlmean']).swaplevel(axis = 1).sort_index(axis = 1).reset_index() # reset_index so you can refer to specific columns
groupby
df2.groupby(['date', # groupby these columns to make the index
'condition',
'rawsource',
'observations']).agg({'value' : 'max', # take the max of these as the aggregate (there was only one value for each so the max just returns that value)
'ctrl_mean' : 'max', # unstack('observations') makes columns out of the 'observations'
'normalize_by_ctrlmean' : 'max'}).unstack('observations').swaplevel(axis = 1).sort_index(axis = 1).reset_index() # these do the same thing as on the pivot_table example
In addition, you can remove the swaplevel
and sort_index
functions to keep the aggregate columns on the top level instead of the observations
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With