I came across something curious (to me) while trying to answer this question.
Say I want to compare a series of shape (10,) to a df of shape (10,10):
np.random.seed(0)
my_ser = pd.Series(np.random.randint(0, 100, size=10))
my_df = pd.DataFrame(np.random.randint(0, 100, size=100).reshape(10,10))
my_ser > 10 * my_df
yields, as expected, a matrix of the shape of the df (10,10). The comparison seems to be row-wise.
However consider this case:
df = pd.DataFrame({'cell1':[0.006209, 0.344955, 0.004521, 0, 0.018931, 0.439725, 0.013195, 0.009045, 0, 0.02614, 0],
'cell2':[0.048043, 0.001077, 0,0.010393, 0.031546, 0.287264, 0.016732, 0.030291, 0.016236, 0.310639,0],
'cell3':[0,0,0.020238, 0, 0.03811, 0.579348, 0.005906, 0,0,0.068352, 0.030165],
'cell4':[0.016139, 0.009359, 0,0,0.025449, 0.47779, 0, 0.01282, 0.005107, 0.004846, 0],
'cell5': [0,0,0,0.012075, 0.031668, 0.520258, 0,0,0,2.728218, 0.013418]})
i = 0
df.iloc[:,i].shape
>(11,)
(10 * df.drop(df.columns[i], axis=1)).shape
>(11,4)
(df.iloc[:,i] > (10 * df.drop(df.columns[i], axis=1))).shape
>(11,15)
As far as I can tell, here Pandas broadcasts the Series with the df. Why is this?
The desired behaviour can be gotten with:
(10 * df.drop(df.columns[i], axis=1)).lt(df.iloc[:,i], axis=0).shape
>(11,4)
pd.__version__
'0.24.0'
Pandas in general is used for financial time series data/economics data (it has a lot of built in helpers to handle financial data). Numpy is a fast way to handle large arrays multidimensional arrays for scientific computing (scipy also helps).
So the term broadcasting comes from numpy, simply put it explains the rules of the output that will result when you perform operations between n-dimensional arrays (could be panels, dataframes, series) or scalar values.
A pandas Series is a one-dimensional labelled data structure which can hold data such as strings, integers and even other Python objects. It is built on top of numpy array and is the primary data structure to hold one-dimensional data in pandas.
What is happening is pandas using intrinsic data alignment. Pandas almost always aligns the data on indexes, either row index or column headers. Here is a quick example:
s1 = pd.Series([1,2,3], index=['a','b','c'])
s2 = pd.Series([2,4,6], index=['a','b','c'])
s1 + s2
#Ouput as expected:
a 3
b 6
c 9
dtype: int64
Now, let's run a couple other examples with different indexing:
s2 = pd.Series([2,4,6], index=['a','a','c'])
s1 + s2
#Ouput
a 3.0
a 5.0
b NaN
c 9.0
dtype: float64
A cartesian product happens with duplicated indexes, and matching is NaN + value = NaN.
And, no matching indexes:
s2 = pd.Series([2,4,6], index=['e','f','g'])
s1 + s2
#Output
a NaN
b NaN
c NaN
e NaN
f NaN
g NaN
dtype: float64
So, in your first example you are creating pd.Series and pd.DataFrame with default range indexes that match, hence the comparison is happening as expected. In your second example, you are comparing column headers ['cell2','cell3','cell4','cell5'] with a the default range index which is returning all 15 columns and no matches all values will be False, NaN comparison returns False.
Bottom line, Pandas compares each series value to the column with the title which matches the value index. The indices in your second example are 0..10, and the column names cell1..4
, so no column name matches, and you just append new columns. This is essentially treating the series as a dataframe with the index as the column titles.
You can actually see part of what pandas does in your first example if you make your series longer than the amount of columns:
>>> my_ser = pd.Series(np.random.randint(0, 100, size=20))
>>> my_df
0 1 2 3 4
0 9 10 27 45 71
1 39 61 85 97 44
2 34 34 88 33 5
3 36 0 75 34 69
4 53 80 62 8 61
5 1 81 35 91 40
6 36 48 25 67 35
7 30 29 33 18 17
8 93 84 2 69 12
9 44 66 91 85 39
>>> my_ser
0 92
1 36
2 25
3 32
4 42
5 14
6 86
7 28
8 20
9 82
10 68
11 22
12 99
13 83
14 7
15 72
16 61
17 13
18 5
19 0
dtype: int64
>>> my_ser>my_df
0 1 2 3 4 5 6 7 8 9 \
0 True True False False False False False False False False
1 True False False False False False False False False False
2 True True False False True False False False False False
3 True True False False False False False False False False
4 True False False True False False False False False False
5 True False False False True False False False False False
6 True False False False True False False False False False
7 True True False True True False False False False False
8 False False True False True False False False False False
9 True False False False True False False False False False
10 11 12 13 14 15 16 17 18 19
0 False False False False False False False False False False
1 False False False False False False False False False False
2 False False False False False False False False False False
3 False False False False False False False False False False
4 False False False False False False False False False False
5 False False False False False False False False False False
6 False False False False False False False False False False
7 False False False False False False False False False False
8 False False False False False False False False False False
9 False False False False False False False False False False
Note what is happening - 92 is compared to the first column, so you get a single False
at 93. Then 36 is compared to the second column etc. If your series matches in length your amount of columns, then you get the expected behavior.
But what happens when your series is longer? Well, you need to append a new fake column to the data frame to continue the comparison. What is it filled with? I found no documentation, but my impression is it just fills in False, since there is nothing to compare to. Hence you get extra columns to match the series length, all False
.
But what about your example. You do not get 11 columns, but 4+11=15! Let's make another test:
>>> my_df = pd.DataFrame(np.random.randint(0, 100, size=100).reshape(10,10),columns=[chr(i) for i in range(10)])
>>> my_ser = pd.Series(np.random.randint(0, 100, size=10))
>>> (my_df>my_ser).shape
(10, 20)
This time we got the sum of the dimensions, 10+10=20, as the amount of output columns!
What was the difference? Pandas compares each series index with the matching column title. In your first example, the index of my_ser
and my_df
titles matched, so it compared them. If there are extra columns - the above is what happens. If all columns have different names then the series indices, then all the columns are extra, and you get your result, and what happens in my example where the titles are now characters, and the index integers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With