Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to normalize by another row in a pandas DataFrame?

Tags:

python

pandas

I'm having trouble finding the non-for loop version of this. Say this is my input:

In [94]: df
Out[94]: 
        N  experiment   color  value    value2
0  145000           0     red   0.30  0.363953
1   14000           1     red   0.31  0.218978
2   13000           2     red   0.29  0.948070
3   15000           0  yellow   0.31  0.620201
4    1200           1  yellow   0.32  0.567513
5    1400           2  yellow   0.31  0.318197
6   40000           0   green   0.29  0.947226
7    3000           1   green   0.31  0.084243
8    7000           2   green   0.32  0.961020

[9 rows x 5 columns]

Experiment 0 is my control. I performed this experiment for various colors. I want to normalize all rows by the matching color experiment 0.

In [104]: df
Out[104]: 
        N  experiment   color  value    value2  scaled_value  scaled_value2
0  145000           0     red   0.30  0.363953      1.000000       1.000000
1   14000           1     red   0.31  0.218978      1.033333       0.590786
2   13000           2     red   0.29  0.948070      0.966667       2.604732
3   15000           0  yellow   0.31  0.620201      1.000000       1.000000
4    1200           1  yellow   0.32  0.567513      1.032258       0.914220
5    1400           2  yellow   0.31  0.318197      1.000000       0.512737
6   40000           0   green   0.29  0.947226      1.000000       1.000000
7    3000           1   green   0.31  0.084243      1.068966       0.088680
8    7000           2   green   0.32  0.961020      1.103448       1.014541

[9 rows x 7 columns]

It seems I would want to exact the experiment zero parameters to their own column to permit easy division, but I can't get the stacking/pivoting/concat'ing to work properly. I did this with for loops using .at and assigning values but it felt so wrong.

The next steps would be normalizing several columns and creating upplower and lower error bounds calculated from (N, value), (N, value2), (N, other_values) etc.

like image 446
physicsmichael Avatar asked Aug 21 '15 15:08

physicsmichael


1 Answers

One way would be to use transform (here using idxmin, although there are many alternatives) to get the indices of the rows we want to use as the denominator:

>>> ii = df.groupby("color")["experiment"].transform("idxmin")
>>> cols = ["value", "value2"]
>>> new_cols = (df.loc[:,cols] /df.loc[ii, cols].values)
>>> df.join(new_cols.rename(columns=lambda x: "scaled_" + x))
        N  experiment   color  value    value2  scaled_value  scaled_value2
0  145000           0     red   0.30  0.363953      1.000000       1.000000
1   14000           1     red   0.31  0.218978      1.033333       0.601666
2   13000           2     red   0.29  0.948070      0.966667       2.604924
3   15000           0  yellow   0.31  0.620201      1.000000       1.000000
4    1200           1  yellow   0.32  0.567513      1.032258       0.915047
5    1400           2  yellow   0.31  0.318197      1.000000       0.513055
6   40000           0   green   0.29  0.947226      1.000000       1.000000
7    3000           1   green   0.31  0.084243      1.068966       0.088937
8    7000           2   green   0.32  0.961020      1.103448       1.014563

Step by step, first we find the denominator indices:

>>> ii = df.groupby("color")["experiment"].transform("idxmin")
>>> ii
0    0
1    0
2    0
3    3
4    3
5    3
6    6
7    6
8    6
dtype: int64

Then we can use this to index into the frame:

>>> df.loc[ii, cols]
   value    value2
0   0.30  0.363953
0   0.30  0.363953
0   0.30  0.363953
3   0.31  0.620201
3   0.31  0.620201
3   0.31  0.620201
6   0.29  0.947226
6   0.29  0.947226
6   0.29  0.947226

Since we're handling the alignment ourselves, we need to call .values to drop down to the underlying array-- otherwise pandas will try to outsmart us and align things correctly based on the indices.

Then we divide:

>>> (df.loc[:,cols] /df.loc[ii, cols].values)
      value    value2
0  1.000000  1.000000
1  1.033333  0.601666
2  0.966667  2.604924
3  1.000000  1.000000
4  1.032258  0.915047
5  1.000000  0.513055
6  1.000000  1.000000
7  1.068966  0.088937
8  1.103448  1.014563

and finally join them, renaming the new columns:

>>> df.join(new_cols.rename(columns=lambda x: "scaled_" + x))
        N  experiment   color  value    value2  scaled_value  scaled_value2
0  145000           0     red   0.30  0.363953      1.000000       1.000000
1   14000           1     red   0.31  0.218978      1.033333       0.601666
2   13000           2     red   0.29  0.948070      0.966667       2.604924
3   15000           0  yellow   0.31  0.620201      1.000000       1.000000
4    1200           1  yellow   0.32  0.567513      1.032258       0.915047
5    1400           2  yellow   0.31  0.318197      1.000000       0.513055
6   40000           0   green   0.29  0.947226      1.000000       1.000000
7    3000           1   green   0.31  0.084243      1.068966       0.088937
8    7000           2   green   0.32  0.961020      1.103448       1.014563
like image 69
DSM Avatar answered Sep 23 '22 07:09

DSM