What is the most idiomatic way to normalize each row of a pandas DataFrame? Normalizing the columns is easy, so one (very ugly!) option is: <pre class="prettyprint"><code>(df.T / df.T.sum()).T </code></pre> Pandas broadcasting rules prevent <code>df / df.sum(axis=1)</code> from doing this

To overcome the broadcasting issue, you can use the <code>div</code> method: <pre class="prettyprint"><code>df.div(df.sum(axis=1), axis=0) </code></pre> See pandas User Guide: Matching / broadcasting behavior

We could also get the underlying numpy array, sum on axis while keeping the dimensions and element-wise divide: <pre class="prettyprint"><code>df / df.to_numpy().sum(axis=1, keepdims=True) </code></pre> <hr> This method is ~60% faster than <code>sum</code> on axis + <code>div</code> by the index: <pre class="prettyprint"><code>df = pd.DataFrame(np.random.rand(1000000, 100)) %timeit -n 10 df.div(df.sum(axis=1), axis=0) 748 ms ± 14.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) %timeit -n 10 df / df.to_numpy().sum(axis=1, keepdims=True) 452 ms ± 4.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) </code></pre> In fact, this trend holds if we increase the number of rows and the number of columns: <img src="https://i.stack.imgur.com/n2WJx.png" alt="enter image description here"> <hr> Code to reproduce the above plots: <pre class="prettyprint"><code>import perfplot import pandas as pd import numpy as np def enke(df): return df / df.to_numpy().sum(axis=1, keepdims=True) def joris(df): return df.div(df.sum(axis=1), axis=0) perfplot.show( setup=lambda n: pd.DataFrame(np.random.rand(n, 10)), kernels=[enke, joris], labels=['enke', 'joris'], n_range=[2 ** k for k in range(4, 21)], equality_check=np.allclose, xlabel='~len(df)', title='For len(df)x10 DataFrames' ) perfplot.show( setup=lambda n: pd.DataFrame(np.random.rand(10000, n)), kernels=[enke, joris], labels=['enke', 'joris'], n_range=[1.4 ** k for k in range(21)], equality_check=np.allclose, xlabel='~width(df)', title='For 10_000xwidth(df) DataFrames' ) </code></pre>

Normalizing a pandas DataFrame by row

Tags:

python

pandas

dataframe

normalization

What is the most idiomatic way to normalize each row of a pandas DataFrame? Normalizing the columns is easy, so one (very ugly!) option is:

(df.T / df.T.sum()).T

Pandas broadcasting rules prevent df / df.sum(axis=1) from doing this

400

asked Sep 03 '13 14:09

ChrisB

2 Answers

To overcome the broadcasting issue, you can use the div method:

df.div(df.sum(axis=1), axis=0)

See pandas User Guide: Matching / broadcasting behavior

answered Sep 19 '22 22:09

joris

We could also get the underlying numpy array, sum on axis while keeping the dimensions and element-wise divide:

df / df.to_numpy().sum(axis=1, keepdims=True)

This method is ~60% faster than sum on axis + div by the index:

df = pd.DataFrame(np.random.rand(1000000, 100))  %timeit -n 10 df.div(df.sum(axis=1), axis=0) 748 ms ± 14.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  %timeit -n 10 df / df.to_numpy().sum(axis=1, keepdims=True) 452 ms ± 4.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In fact, this trend holds if we increase the number of rows and the number of columns:

enter image description here

Code to reproduce the above plots:

import perfplot import pandas as pd import numpy as np  def enke(df):     return df / df.to_numpy().sum(axis=1, keepdims=True)  def joris(df):     return df.div(df.sum(axis=1), axis=0)  perfplot.show(     setup=lambda n: pd.DataFrame(np.random.rand(n, 10)),      kernels=[enke, joris],     labels=['enke', 'joris'],     n_range=[2 ** k for k in range(4, 21)],     equality_check=np.allclose,       xlabel='~len(df)',     title='For len(df)x10 DataFrames' )  perfplot.show(     setup=lambda n: pd.DataFrame(np.random.rand(10000, n)),      kernels=[enke, joris],     labels=['enke', 'joris'],     n_range=[1.4 ** k for k in range(21)],     equality_check=np.allclose,       xlabel='~width(df)',     title='For 10_000xwidth(df) DataFrames' )

answered Sep 19 '22 22:09

enke

Related questions
                            
                                Execute .sql schema in psycopg2 in Python
                            
                                What's the best way to split a string into fixed length chunks and work with them in Python?
                            
                                Seaborn RegPlot Partially See Through (alpha)
                            
                                Anaconda Installed but Cannot Launch Navigator
                            
                                Finding max value in the second column of a nested list?
                            
                                Reverse a string in Python two characters at a time (Network byte order)
                            
                                Benefits of panda's multiindex?
                            
                                Any way to override the and operator in Python?
                            
                                Grep on elements of a list
                            
                                scipy.stats seed?
                            
                                Python Untokenize a sentence
                            
                                How to calculate CRC32 with Python to match online results?
                            
                                How to set connection timeout in SQLAlchemy
                            
                                retrieving list items from request.POST in django/python
                            
                                How can I create a standard colorbar for a series of plots in python
                            
                                overwriting a spark output using pyspark
                            
                                Per-project flake8 max line length?
                            
                                How to drop a specific column of csv file while reading it using pandas?
                            
                                Python OSError: [Errno 2]
                            
                                Sorting a 2D numpy array by multiple axes

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With