Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Normalizing a pandas DataFrame by row

What is the most idiomatic way to normalize each row of a pandas DataFrame? Normalizing the columns is easy, so one (very ugly!) option is:

(df.T / df.T.sum()).T 

Pandas broadcasting rules prevent df / df.sum(axis=1) from doing this

like image 400
ChrisB Avatar asked Sep 03 '13 14:09

ChrisB


People also ask

How do I normalize a DataFrame in Pandas?

Using The min-max feature scaling The min-max approach (often called normalization) rescales the feature to a hard and fast range of [0,1] by subtracting the minimum value of the feature then dividing by the range. We can apply the min-max scaling in Pandas using the . min() and . max() methods.

How do you normalize a specific column in a DataFrame?

To normalize all columns of pandas DataFrame, we simply subtract the mean and divide by standard deviation. This example gives unbiased estimates. Alternatively, you can also get the same using DataFrame. apply() and lambda .

How do you normalize data to 0 1 range in Python?

You can normalize data between 0 and 1 range by using the formula (data – np. min(data)) / (np. max(data) – np. min(data)) .


2 Answers

To overcome the broadcasting issue, you can use the div method:

df.div(df.sum(axis=1), axis=0) 

See pandas User Guide: Matching / broadcasting behavior

like image 93
joris Avatar answered Sep 19 '22 22:09

joris


We could also get the underlying numpy array, sum on axis while keeping the dimensions and element-wise divide:

df / df.to_numpy().sum(axis=1, keepdims=True) 

This method is ~60% faster than sum on axis + div by the index:

df = pd.DataFrame(np.random.rand(1000000, 100))  %timeit -n 10 df.div(df.sum(axis=1), axis=0) 748 ms ± 14.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  %timeit -n 10 df / df.to_numpy().sum(axis=1, keepdims=True) 452 ms ± 4.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 

In fact, this trend holds if we increase the number of rows and the number of columns:

enter image description here


Code to reproduce the above plots:

import perfplot import pandas as pd import numpy as np  def enke(df):     return df / df.to_numpy().sum(axis=1, keepdims=True)  def joris(df):     return df.div(df.sum(axis=1), axis=0)  perfplot.show(     setup=lambda n: pd.DataFrame(np.random.rand(n, 10)),      kernels=[enke, joris],     labels=['enke', 'joris'],     n_range=[2 ** k for k in range(4, 21)],     equality_check=np.allclose,       xlabel='~len(df)',     title='For len(df)x10 DataFrames' )  perfplot.show(     setup=lambda n: pd.DataFrame(np.random.rand(10000, n)),      kernels=[enke, joris],     labels=['enke', 'joris'],     n_range=[1.4 ** k for k in range(21)],     equality_check=np.allclose,       xlabel='~width(df)',     title='For 10_000xwidth(df) DataFrames' ) 
like image 42
enke Avatar answered Sep 19 '22 22:09

enke