I want to normalize the values in one column of a pandas dataframe based on the value in another column. It's not a pure normalization in a statistical sense. The second value is a type; I want to sum all the first values for each type, then, in each row, divide the value by the total for the type of that row. An example should make this clearer.
df = pd.read_table(datafile, names = ["A", "B", "value", "type"])
A B value type
0 A1 B1 1 type1
1 A2 B2 1 type1
2 A1 B1 1 type2
3 A1 B3 1 type3
4 A2 B2 1 type2
5 A2 B4 1 type3
6 A3 B4 1 type2
7 A3 B5 1 type3
8 A4 B6 1 type2
9 A4 B7 1 type3
Then I can find the sums with something like:
types = df.groupby(["type"])["value"].sum()
type
type1 2
type2 4
type3 4
Name: value, dtype: int64
Then how do I use this to normalize the value in each row?
I can use a loop like this to calculate the normalized values:
norms = []
for ix, row in df.iterrows():
norms.append(row["value"]/types[row["type"]])
And then replace the column with a new one with these values:
df["value"] = pd.Series(norms)
A B value type
0 A1 B1 0.50 type1
1 A2 B2 0.50 type1
2 A1 B1 0.25 type2
3 A1 B3 0.25 type3
4 A2 B2 0.25 type2
5 A2 B4 0.25 type3
6 A3 B4 0.25 type2
7 A3 B5 0.25 type3
8 A4 B6 0.25 type2
9 A4 B7 0.25 type3
But as I understand, using loops like this is not very efficient or proper, and there is likely a way to do it using some standard pandas functions.
Thanks.
You could use transform
, which performs an operation on each group and then expands the result back up to match the original index. For example"
>>> df["value"] /= df.groupby("type")["value"].transform(sum)
>>> df
A B value type
0 A1 B1 0.50 type1
1 A2 B2 0.50 type1
2 A1 B1 0.25 type2
3 A1 B3 0.25 type3
4 A2 B2 0.25 type2
5 A2 B4 0.25 type3
6 A3 B4 0.25 type2
7 A3 B5 0.25 type3
8 A4 B6 0.25 type2
9 A4 B7 0.25 type3
because we have
>>> df.groupby("type")["value"].transform(sum)
0 2
1 2
2 4
3 4
4 4
5 4
6 4
7 4
8 4
9 4
dtype: int64
I think the best way to achieve this is by using the .apply()
method on your groupby object:
# Using backslashes for explicit line continuation, not seen
# that often in Python but useful in pandas when you're
# chaining a lot of methods one after the other
df['value_normed'] = df.groupby('type', group_keys=False)\
.apply(lambda g: g['value'] / g['value'].sum())
df
Out[9]:
A B value type value_normed
0 A1 B1 1 type1 0.50
1 A2 B2 1 type1 0.50
2 A1 B1 1 type2 0.25
3 A1 B3 1 type3 0.25
4 A2 B2 1 type2 0.25
5 A2 B4 1 type3 0.25
6 A3 B4 1 type2 0.25
7 A3 B5 1 type3 0.25
8 A4 B6 1 type2 0.25
9 A4 B7 1 type3 0.25
You need the group_keys=False
argument so that type
doesn't become the index for each group's data, which prevents you from matching the transformed values back up to your original dataframe easily.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With