I'm reading through the Pandas documentation, and the term "broadcasting" is used extensively, but never really defined or explained. What does it mean?

So the term broadcasting comes from numpy, simply put it explains the rules of the output that will result when you perform operations between n-dimensional arrays (could be panels, dataframes, series) or scalar values. <h3>Broadcasting using a scalar value</h3> So the simplest case is just multiplying by a scalar value: <pre class="prettyprint"><code>In [4]: s = pd.Series(np.arange(5)) s Out[4]: 0 0 1 1 2 2 3 3 4 4 dtype: int32 In [5]: s * 10 Out[5]: 0 0 1 10 2 20 3 30 4 40 dtype: int32 </code></pre> and we get the same expected results with a dataframe: <pre class="prettyprint"><code>In [6]: df = pd.DataFrame({'a':np.random.randn(4), 'b':np.random.randn(4)}) df Out[6]: a b 0 0.216920 0.652193 1 0.968969 0.033369 2 0.637784 0.856836 3 -2.303556 0.426238 In [7]: df * 10 Out[7]: a b 0 2.169204 6.521925 1 9.689690 0.333695 2 6.377839 8.568362 3 -23.035557 4.262381 </code></pre> So what is technically happening here is that the scalar value has been broadcasted along the same dimensions of the Series and DataFrame above. <h3>Broadcasting using a 1-D array</h3> Say we have a 2-D dataframe of shape 4 x 3 (4 rows x 3 columns) we can perform an operation along the x-axis by using a 1-D Series that is the same length as the row-length: <pre class="prettyprint"><code>In [8]: df = pd.DataFrame({'a':np.random.randn(4), 'b':np.random.randn(4), 'c':np.random.randn(4)}) df Out[8]: a b c 0 0.122073 -1.178127 -1.531254 1 0.011346 -0.747583 -1.967079 2 -0.019716 -0.235676 1.419547 3 0.215847 1.112350 0.659432 In [26]: df.iloc[0] Out[26]: a 0.122073 b -1.178127 c -1.531254 Name: 0, dtype: float64 In [27]: df + df.iloc[0] Out[27]: a b c 0 0.244146 -2.356254 -3.062507 1 0.133419 -1.925710 -3.498333 2 0.102357 -1.413803 -0.111707 3 0.337920 -0.065777 -0.871822 </code></pre> the above looks funny at first until you understand what is happening, I took the first row of values and added this row-wise to the df, it can be visualised using this pic (sourced from <code>scipy</code>): <img src="https://i.stack.imgur.com/kzNxo.gif" alt="enter image description here"> The general rule is this: <blockquote> In order to broadcast, the size of the trailing axes for both arrays in an operation must either be the same size or one of them must be one. </blockquote> So if I tried to add a 1-D array that didn't match in length, say one with 4 elements, unlike numpy which will raise a <code>ValueError</code>, in Pandas you'll get a df full of <code>NaN</code> values: <pre class="prettyprint"><code>In [30]: df + pd.Series(np.arange(4)) Out[30]: a b c 0 1 2 3 0 NaN NaN NaN NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN NaN NaN 2 NaN NaN NaN NaN NaN NaN NaN 3 NaN NaN NaN NaN NaN NaN NaN </code></pre> Now some of the great things about pandas is that it will try to align using existing column names and row labels, this can get in the way of trying to perform a fancier broadcasting like this: <img src="https://i.stack.imgur.com/HcM6z.gif" alt="enter image description here"> <pre class="prettyprint"><code>In [55]: df[['a']] + df.iloc[0] Out[55]: a b c 0 0.244146 NaN NaN 1 0.133419 NaN NaN 2 0.102357 NaN NaN 3 0.337920 NaN NaN </code></pre> In the above I use double subscripting to force the shape to be (4,1) but we see a problem when trying to broadcast using the first row as the column alignment only aligns on the first column. To get the same form of broadcasting to occur like the diagram above shows we have to decompose to numpy arrays which then become anonymous data: <pre class="prettyprint"><code>In [56]: df[['a']].values + df.iloc[0].values Out[56]: array([[ 0.24414608, -1.05605392, -1.4091805 ], [ 0.13341899, -1.166781 , -1.51990758], [ 0.10235701, -1.19784299, -1.55096957], [ 0.33792013, -0.96227987, -1.31540645]]) </code></pre> It's also possible to broadcast in 3-dimensions but I don't go near that stuff often but the numpy, scipy and pandas book have examples that show how that works. Generally speaking the thing to remember is that aside from scalar values which are simple, for n-D arrays the minor/trailing axes length must match or one of them must be 1. Update it seems that the above now leads to <code>ValueError: Unable to coerce to Series, length must be 1: given 3</code> in latest version of pandas <code>0.20.2</code> so you have to call <code>.values</code> on the <code>df</code> first: <pre class="prettyprint"><code>In[42]: df[['a']].values + df.iloc[0].values Out[42]: array([[ 0.244146, -1.056054, -1.409181], [ 0.133419, -1.166781, -1.519908], [ 0.102357, -1.197843, -1.55097 ], [ 0.33792 , -0.96228 , -1.315407]]) </code></pre> To restore this back to the original df we can construct a df from the np array and pass the original columns in the args to the constructor: <pre class="prettyprint"><code>In[43]: pd.DataFrame(df[['a']].values + df.iloc[0].values, columns=df.columns) Out[43]: a b c 0 0.244146 -1.056054 -1.409181 1 0.133419 -1.166781 -1.519908 2 0.102357 -1.197843 -1.550970 3 0.337920 -0.962280 -1.315407 </code></pre>

What does the term "broadcasting" mean in Pandas documentation?

1 Answers

So the term broadcasting comes from numpy, simply put it explains the rules of the output that will result when you perform operations between n-dimensional arrays (could be panels, dataframes, series) or scalar values.

Broadcasting using a scalar value

So the simplest case is just multiplying by a scalar value:

In [4]: s = pd.Series(np.arange(5)) s  Out[4]: 0    0 1    1 2    2 3    3 4    4 dtype: int32  In [5]:     s * 10  Out[5]: 0     0 1    10 2    20 3    30 4    40 dtype: int32

and we get the same expected results with a dataframe:

In [6]:     df = pd.DataFrame({'a':np.random.randn(4), 'b':np.random.randn(4)}) df  Out[6]:           a         b 0  0.216920  0.652193 1  0.968969  0.033369 2  0.637784  0.856836 3 -2.303556  0.426238  In [7]:     df * 10  Out[7]:            a         b 0   2.169204  6.521925 1   9.689690  0.333695 2   6.377839  8.568362 3 -23.035557  4.262381

So what is technically happening here is that the scalar value has been broadcasted along the same dimensions of the Series and DataFrame above.

Broadcasting using a 1-D array

Say we have a 2-D dataframe of shape 4 x 3 (4 rows x 3 columns) we can perform an operation along the x-axis by using a 1-D Series that is the same length as the row-length:

In [8]: df = pd.DataFrame({'a':np.random.randn(4), 'b':np.random.randn(4), 'c':np.random.randn(4)}) df  Out[8]:           a         b         c 0  0.122073 -1.178127 -1.531254 1  0.011346 -0.747583 -1.967079 2 -0.019716 -0.235676  1.419547 3  0.215847  1.112350  0.659432  In [26]:     df.iloc[0]  Out[26]: a    0.122073 b   -1.178127 c   -1.531254 Name: 0, dtype: float64  In [27]:     df + df.iloc[0]  Out[27]:           a         b         c 0  0.244146 -2.356254 -3.062507 1  0.133419 -1.925710 -3.498333 2  0.102357 -1.413803 -0.111707 3  0.337920 -0.065777 -0.871822

the above looks funny at first until you understand what is happening, I took the first row of values and added this row-wise to the df, it can be visualised using this pic (sourced from scipy):

enter image description here

The general rule is this:

In order to broadcast, the size of the trailing axes for both arrays in an operation must either be the same size or one of them must be one.

So if I tried to add a 1-D array that didn't match in length, say one with 4 elements, unlike numpy which will raise a ValueError, in Pandas you'll get a df full of NaN values:

In [30]: df + pd.Series(np.arange(4))  Out[30]:     a   b   c   0   1   2   3 0 NaN NaN NaN NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN NaN NaN 2 NaN NaN NaN NaN NaN NaN NaN 3 NaN NaN NaN NaN NaN NaN NaN

Now some of the great things about pandas is that it will try to align using existing column names and row labels, this can get in the way of trying to perform a fancier broadcasting like this:

enter image description here

In [55]: df[['a']] + df.iloc[0]  Out[55]:           a   b   c 0  0.244146 NaN NaN 1  0.133419 NaN NaN 2  0.102357 NaN NaN 3  0.337920 NaN NaN

In the above I use double subscripting to force the shape to be (4,1) but we see a problem when trying to broadcast using the first row as the column alignment only aligns on the first column. To get the same form of broadcasting to occur like the diagram above shows we have to decompose to numpy arrays which then become anonymous data:

In [56]: df[['a']].values + df.iloc[0].values  Out[56]: array([[ 0.24414608, -1.05605392, -1.4091805 ],        [ 0.13341899, -1.166781  , -1.51990758],        [ 0.10235701, -1.19784299, -1.55096957],        [ 0.33792013, -0.96227987, -1.31540645]])

It's also possible to broadcast in 3-dimensions but I don't go near that stuff often but the numpy, scipy and pandas book have examples that show how that works.

Generally speaking the thing to remember is that aside from scalar values which are simple, for n-D arrays the minor/trailing axes length must match or one of them must be 1.

Update

it seems that the above now leads to ValueError: Unable to coerce to Series, length must be 1: given 3 in latest version of pandas 0.20.2

so you have to call .values on the df first:

In[42]: df[['a']].values + df.iloc[0].values  Out[42]:  array([[ 0.244146, -1.056054, -1.409181],        [ 0.133419, -1.166781, -1.519908],        [ 0.102357, -1.197843, -1.55097 ],        [ 0.33792 , -0.96228 , -1.315407]])

To restore this back to the original df we can construct a df from the np array and pass the original columns in the args to the constructor:

In[43]: pd.DataFrame(df[['a']].values + df.iloc[0].values, columns=df.columns)  Out[43]:            a         b         c 0  0.244146 -1.056054 -1.409181 1  0.133419 -1.166781 -1.519908 2  0.102357 -1.197843 -1.550970 3  0.337920 -0.962280 -1.315407

194

answered Oct 05 '22 06:10

EdChum

Related questions
                            
                                Sending string via socket (python)
                            
                                How to set dependencies between DAGs in Airflow?
                            
                                How to show instance attributes in sphinx doc?
                            
                                Why does this AttributeError in python occur?
                            
                                Django F expressions joined field
                            
                                Using a Python subprocess call to invoke a Python script
                            
                                numpy.savetxt without hash mark at beginning of header line
                            
                                Set the font size in pycharm's python console or terminal
                            
                                Testing class methods with pytest
                            
                                What's the correct way to check if an object is a typing.Generic?
                            
                                Mutable strings in Python
                            
                                How to find out number/name of unicode character in Python?
                            
                                Get name of primary field of Django model
                            
                                Python getting meaningful results from cProfile
                            
                                List comprehension list of lists
                            
                                Verifying PEP8 in iPython notebook code
                            
                                How do I change directories using Paramiko?
                            
                                How to sort a list of tuples according to another list
                            
                                Why is dictionary ordering non-deterministic?
                            
                                How do I call a Django function on button click?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What does the term "broadcasting" mean in Pandas documentation?

Tags:

python

pandas

numpy

Dun Peal

People also ask

1 Answers

Broadcasting using a scalar value

Broadcasting using a 1-D array

EdChum

Recent Activity

Donate For Us