What is the most idiomatic way to normalize each row of a pandas DataFrame? Normalizing the columns is easy, so one (very ugly!) option is:
(df.T / df.T.sum()).T
Pandas broadcasting rules prevent df / df.sum(axis=1)
from doing this
Using The min-max feature scaling The min-max approach (often called normalization) rescales the feature to a hard and fast range of [0,1] by subtracting the minimum value of the feature then dividing by the range. We can apply the min-max scaling in Pandas using the . min() and . max() methods.
To normalize all columns of pandas DataFrame, we simply subtract the mean and divide by standard deviation. This example gives unbiased estimates. Alternatively, you can also get the same using DataFrame. apply() and lambda .
You can normalize data between 0 and 1 range by using the formula (data – np. min(data)) / (np. max(data) – np. min(data)) .
To overcome the broadcasting issue, you can use the div
method:
df.div(df.sum(axis=1), axis=0)
See pandas User Guide: Matching / broadcasting behavior
We could also get the underlying numpy array, sum on axis while keeping the dimensions and element-wise divide:
df / df.to_numpy().sum(axis=1, keepdims=True)
This method is ~60% faster than sum
on axis + div
by the index:
df = pd.DataFrame(np.random.rand(1000000, 100)) %timeit -n 10 df.div(df.sum(axis=1), axis=0) 748 ms ± 14.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) %timeit -n 10 df / df.to_numpy().sum(axis=1, keepdims=True) 452 ms ± 4.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In fact, this trend holds if we increase the number of rows and the number of columns:
Code to reproduce the above plots:
import perfplot import pandas as pd import numpy as np def enke(df): return df / df.to_numpy().sum(axis=1, keepdims=True) def joris(df): return df.div(df.sum(axis=1), axis=0) perfplot.show( setup=lambda n: pd.DataFrame(np.random.rand(n, 10)), kernels=[enke, joris], labels=['enke', 'joris'], n_range=[2 ** k for k in range(4, 21)], equality_check=np.allclose, xlabel='~len(df)', title='For len(df)x10 DataFrames' ) perfplot.show( setup=lambda n: pd.DataFrame(np.random.rand(10000, n)), kernels=[enke, joris], labels=['enke', 'joris'], n_range=[1.4 ** k for k in range(21)], equality_check=np.allclose, xlabel='~width(df)', title='For 10_000xwidth(df) DataFrames' )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With