Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

normalize column in pandas dataframe based on value in another column

I want to normalize the values in one column of a pandas dataframe based on the value in another column. It's not a pure normalization in a statistical sense. The second value is a type; I want to sum all the first values for each type, then, in each row, divide the value by the total for the type of that row. An example should make this clearer.

df = pd.read_table(datafile, names = ["A", "B", "value", "type"])

    A   B  value   type
0  A1  B1      1  type1
1  A2  B2      1  type1
2  A1  B1      1  type2
3  A1  B3      1  type3
4  A2  B2      1  type2
5  A2  B4      1  type3
6  A3  B4      1  type2
7  A3  B5      1  type3
8  A4  B6      1  type2
9  A4  B7      1  type3

Then I can find the sums with something like:

types = df.groupby(["type"])["value"].sum()

type
type1    2
type2    4
type3    4
Name: value, dtype: int64

Then how do I use this to normalize the value in each row?

I can use a loop like this to calculate the normalized values:

norms = []
for ix, row in df.iterrows():
    norms.append(row["value"]/types[row["type"]])

And then replace the column with a new one with these values:

df["value"] = pd.Series(norms)

    A   B  value   type
0  A1  B1   0.50  type1
1  A2  B2   0.50  type1
2  A1  B1   0.25  type2
3  A1  B3   0.25  type3
4  A2  B2   0.25  type2
5  A2  B4   0.25  type3
6  A3  B4   0.25  type2
7  A3  B5   0.25  type3
8  A4  B6   0.25  type2
9  A4  B7   0.25  type3

But as I understand, using loops like this is not very efficient or proper, and there is likely a way to do it using some standard pandas functions.

Thanks.

like image 989
mepstein Avatar asked Sep 13 '25 14:09

mepstein


2 Answers

You could use transform, which performs an operation on each group and then expands the result back up to match the original index. For example"

>>> df["value"] /= df.groupby("type")["value"].transform(sum)
>>> df
    A   B  value   type
0  A1  B1   0.50  type1
1  A2  B2   0.50  type1
2  A1  B1   0.25  type2
3  A1  B3   0.25  type3
4  A2  B2   0.25  type2
5  A2  B4   0.25  type3
6  A3  B4   0.25  type2
7  A3  B5   0.25  type3
8  A4  B6   0.25  type2
9  A4  B7   0.25  type3

because we have

>>> df.groupby("type")["value"].transform(sum)
0    2
1    2
2    4
3    4
4    4
5    4
6    4
7    4
8    4
9    4
dtype: int64
like image 188
DSM Avatar answered Sep 16 '25 08:09

DSM


I think the best way to achieve this is by using the .apply() method on your groupby object:

# Using backslashes for explicit line continuation, not seen
#   that often in Python but useful in pandas when you're
#   chaining a lot of methods one after the other
df['value_normed'] = df.groupby('type', group_keys=False)\
    .apply(lambda g: g['value'] / g['value'].sum())
df
Out[9]: 
    A   B  value   type  value_normed
0  A1  B1      1  type1          0.50
1  A2  B2      1  type1          0.50
2  A1  B1      1  type2          0.25
3  A1  B3      1  type3          0.25
4  A2  B2      1  type2          0.25
5  A2  B4      1  type3          0.25
6  A3  B4      1  type2          0.25
7  A3  B5      1  type3          0.25
8  A4  B6      1  type2          0.25
9  A4  B7      1  type3          0.25

You need the group_keys=False argument so that type doesn't become the index for each group's data, which prevents you from matching the transformed values back up to your original dataframe easily.

like image 41
Marius Avatar answered Sep 16 '25 07:09

Marius