I'm reading through the Pandas documentation, and the term "broadcasting" is used extensively, but never really defined or explained.
What does it mean?
So the term broadcasting comes from numpy, simply put it explains the rules of the output that will result when you perform operations between n-dimensional arrays (could be panels, dataframes, series) or scalar values.
Overview. The broadcast() function in NumPy's Python library is used to create an object that mimics broadcasting. Broadcasting in NumPy describes how NumPy treats arrays with different shapes by making use of arithmetic operations.
Pandas stands for “Python Data Analysis Library ”. According to the Wikipedia page on Pandas, “the name is derived from the term “panel data”, an econometrics term for multidimensional structured data sets.” But I think it's just a cute name to a super-useful Python library!
The isin() method checks if the Dataframe contains the specified value(s). It returns a DataFrame similar to the original DataFrame, but the original values have been replaced with True if the value was one of the specified values, otherwise False .
So the term broadcasting comes from numpy, simply put it explains the rules of the output that will result when you perform operations between n-dimensional arrays (could be panels, dataframes, series) or scalar values.
So the simplest case is just multiplying by a scalar value:
In [4]: s = pd.Series(np.arange(5)) s Out[4]: 0 0 1 1 2 2 3 3 4 4 dtype: int32 In [5]: s * 10 Out[5]: 0 0 1 10 2 20 3 30 4 40 dtype: int32
and we get the same expected results with a dataframe:
In [6]: df = pd.DataFrame({'a':np.random.randn(4), 'b':np.random.randn(4)}) df Out[6]: a b 0 0.216920 0.652193 1 0.968969 0.033369 2 0.637784 0.856836 3 -2.303556 0.426238 In [7]: df * 10 Out[7]: a b 0 2.169204 6.521925 1 9.689690 0.333695 2 6.377839 8.568362 3 -23.035557 4.262381
So what is technically happening here is that the scalar value has been broadcasted along the same dimensions of the Series and DataFrame above.
Say we have a 2-D dataframe of shape 4 x 3 (4 rows x 3 columns) we can perform an operation along the x-axis by using a 1-D Series that is the same length as the row-length:
In [8]: df = pd.DataFrame({'a':np.random.randn(4), 'b':np.random.randn(4), 'c':np.random.randn(4)}) df Out[8]: a b c 0 0.122073 -1.178127 -1.531254 1 0.011346 -0.747583 -1.967079 2 -0.019716 -0.235676 1.419547 3 0.215847 1.112350 0.659432 In [26]: df.iloc[0] Out[26]: a 0.122073 b -1.178127 c -1.531254 Name: 0, dtype: float64 In [27]: df + df.iloc[0] Out[27]: a b c 0 0.244146 -2.356254 -3.062507 1 0.133419 -1.925710 -3.498333 2 0.102357 -1.413803 -0.111707 3 0.337920 -0.065777 -0.871822
the above looks funny at first until you understand what is happening, I took the first row of values and added this row-wise to the df, it can be visualised using this pic (sourced from scipy
):
The general rule is this:
In order to broadcast, the size of the trailing axes for both arrays in an operation must either be the same size or one of them must be one.
So if I tried to add a 1-D array that didn't match in length, say one with 4 elements, unlike numpy which will raise a ValueError
, in Pandas you'll get a df full of NaN
values:
In [30]: df + pd.Series(np.arange(4)) Out[30]: a b c 0 1 2 3 0 NaN NaN NaN NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN NaN NaN 2 NaN NaN NaN NaN NaN NaN NaN 3 NaN NaN NaN NaN NaN NaN NaN
Now some of the great things about pandas is that it will try to align using existing column names and row labels, this can get in the way of trying to perform a fancier broadcasting like this:
In [55]: df[['a']] + df.iloc[0] Out[55]: a b c 0 0.244146 NaN NaN 1 0.133419 NaN NaN 2 0.102357 NaN NaN 3 0.337920 NaN NaN
In the above I use double subscripting to force the shape to be (4,1) but we see a problem when trying to broadcast using the first row as the column alignment only aligns on the first column. To get the same form of broadcasting to occur like the diagram above shows we have to decompose to numpy arrays which then become anonymous data:
In [56]: df[['a']].values + df.iloc[0].values Out[56]: array([[ 0.24414608, -1.05605392, -1.4091805 ], [ 0.13341899, -1.166781 , -1.51990758], [ 0.10235701, -1.19784299, -1.55096957], [ 0.33792013, -0.96227987, -1.31540645]])
It's also possible to broadcast in 3-dimensions but I don't go near that stuff often but the numpy, scipy and pandas book have examples that show how that works.
Generally speaking the thing to remember is that aside from scalar values which are simple, for n-D arrays the minor/trailing axes length must match or one of them must be 1.
Update
it seems that the above now leads to ValueError: Unable to coerce to Series, length must be 1: given 3
in latest version of pandas 0.20.2
so you have to call .values
on the df
first:
In[42]: df[['a']].values + df.iloc[0].values Out[42]: array([[ 0.244146, -1.056054, -1.409181], [ 0.133419, -1.166781, -1.519908], [ 0.102357, -1.197843, -1.55097 ], [ 0.33792 , -0.96228 , -1.315407]])
To restore this back to the original df we can construct a df from the np array and pass the original columns in the args to the constructor:
In[43]: pd.DataFrame(df[['a']].values + df.iloc[0].values, columns=df.columns) Out[43]: a b c 0 0.244146 -1.056054 -1.409181 1 0.133419 -1.166781 -1.519908 2 0.102357 -1.197843 -1.550970 3 0.337920 -0.962280 -1.315407
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With