When does Pandas default to broadcasting Series and Dataframes?

Tags:

I came across something curious (to me) while trying to answer this question.

Say I want to compare a series of shape (10,) to a df of shape (10,10):

np.random.seed(0)
my_ser = pd.Series(np.random.randint(0, 100, size=10))
my_df = pd.DataFrame(np.random.randint(0, 100, size=100).reshape(10,10))
my_ser > 10 * my_df

yields, as expected, a matrix of the shape of the df (10,10). The comparison seems to be row-wise.

However consider this case:

df = pd.DataFrame({'cell1':[0.006209, 0.344955, 0.004521, 0, 0.018931, 0.439725, 0.013195, 0.009045, 0, 0.02614, 0],
              'cell2':[0.048043, 0.001077, 0,0.010393, 0.031546, 0.287264, 0.016732, 0.030291, 0.016236, 0.310639,0], 
              'cell3':[0,0,0.020238, 0, 0.03811, 0.579348, 0.005906, 0,0,0.068352, 0.030165],
              'cell4':[0.016139, 0.009359, 0,0,0.025449, 0.47779, 0, 0.01282, 0.005107, 0.004846, 0],
              'cell5': [0,0,0,0.012075, 0.031668, 0.520258, 0,0,0,2.728218, 0.013418]})
i = 0
df.iloc[:,i].shape
>(11,)
(10 * df.drop(df.columns[i], axis=1)).shape
>(11,4)
(df.iloc[:,i] > (10 * df.drop(df.columns[i], axis=1))).shape
>(11,15)

As far as I can tell, here Pandas broadcasts the Series with the df. Why is this?

The desired behaviour can be gotten with:

(10 * df.drop(df.columns[i], axis=1)).lt(df.iloc[:,i], axis=0).shape
>(11,4)

pd.__version__
'0.24.0'

340

asked Feb 17 '19 08:02

Josh Friedlander

2 Answers

What is happening is pandas using intrinsic data alignment. Pandas almost always aligns the data on indexes, either row index or column headers. Here is a quick example:

s1 = pd.Series([1,2,3], index=['a','b','c'])
s2 = pd.Series([2,4,6], index=['a','b','c'])
s1 + s2
#Ouput as expected:
a    3
b    6
c    9
dtype: int64

Now, let's run a couple other examples with different indexing:

s2 = pd.Series([2,4,6], index=['a','a','c'])
s1 + s2
#Ouput
a    3.0
a    5.0
b    NaN
c    9.0
dtype: float64

A cartesian product happens with duplicated indexes, and matching is NaN + value = NaN.

And, no matching indexes:

s2 = pd.Series([2,4,6], index=['e','f','g'])
s1 + s2
#Output
a   NaN
b   NaN
c   NaN
e   NaN
f   NaN
g   NaN
dtype: float64

So, in your first example you are creating pd.Series and pd.DataFrame with default range indexes that match, hence the comparison is happening as expected. In your second example, you are comparing column headers ['cell2','cell3','cell4','cell5'] with a the default range index which is returning all 15 columns and no matches all values will be False, NaN comparison returns False.

193

answered Oct 20 '22 00:10

Scott Boston

Bottom line, Pandas compares each series value to the column with the title which matches the value index. The indices in your second example are 0..10, and the column names cell1..4, so no column name matches, and you just append new columns. This is essentially treating the series as a dataframe with the index as the column titles.

You can actually see part of what pandas does in your first example if you make your series longer than the amount of columns:

>>> my_ser = pd.Series(np.random.randint(0, 100, size=20))
>>> my_df
    0   1   2   3   4
0   9  10  27  45  71
1  39  61  85  97  44
2  34  34  88  33   5
3  36   0  75  34  69
4  53  80  62   8  61
5   1  81  35  91  40
6  36  48  25  67  35
7  30  29  33  18  17
8  93  84   2  69  12
9  44  66  91  85  39
>>> my_ser
0     92
1     36
2     25
3     32
4     42
5     14
6     86
7     28
8     20
9     82
10    68
11    22
12    99
13    83
14     7
15    72
16    61
17    13
18     5
19     0
dtype: int64
>>> my_ser>my_df
      0      1      2      3      4      5      6      7      8      9   \
0   True   True  False  False  False  False  False  False  False  False
1   True  False  False  False  False  False  False  False  False  False
2   True   True  False  False   True  False  False  False  False  False
3   True   True  False  False  False  False  False  False  False  False
4   True  False  False   True  False  False  False  False  False  False
5   True  False  False  False   True  False  False  False  False  False
6   True  False  False  False   True  False  False  False  False  False
7   True   True  False   True   True  False  False  False  False  False
8  False  False   True  False   True  False  False  False  False  False
9   True  False  False  False   True  False  False  False  False  False

      10     11     12     13     14     15     16     17     18     19
0  False  False  False  False  False  False  False  False  False  False
1  False  False  False  False  False  False  False  False  False  False
2  False  False  False  False  False  False  False  False  False  False
3  False  False  False  False  False  False  False  False  False  False
4  False  False  False  False  False  False  False  False  False  False
5  False  False  False  False  False  False  False  False  False  False
6  False  False  False  False  False  False  False  False  False  False
7  False  False  False  False  False  False  False  False  False  False
8  False  False  False  False  False  False  False  False  False  False
9  False  False  False  False  False  False  False  False  False  False

Note what is happening - 92 is compared to the first column, so you get a single False at 93. Then 36 is compared to the second column etc. If your series matches in length your amount of columns, then you get the expected behavior.

But what happens when your series is longer? Well, you need to append a new fake column to the data frame to continue the comparison. What is it filled with? I found no documentation, but my impression is it just fills in False, since there is nothing to compare to. Hence you get extra columns to match the series length, all False.

But what about your example. You do not get 11 columns, but 4+11=15! Let's make another test:

>>> my_df = pd.DataFrame(np.random.randint(0, 100, size=100).reshape(10,10),columns=[chr(i) for i in range(10)])
>>> my_ser = pd.Series(np.random.randint(0, 100, size=10))
>>> (my_df>my_ser).shape
(10, 20)

This time we got the sum of the dimensions, 10+10=20, as the amount of output columns!

What was the difference? Pandas compares each series index with the matching column title. In your first example, the index of my_ser and my_df titles matched, so it compared them. If there are extra columns - the above is what happens. If all columns have different names then the series indices, then all the columns are extra, and you get your result, and what happens in my example where the titles are now characters, and the index integers.

answered Oct 19 '22 23:10

kabanus

Related questions
                            
                                change a form value before validation in Django form
                            
                                Cause Python's argparse to execute action for default
                            
                                Seaborn palettes - prevent recycling of colors
                            
                                Effective 1-5 grams extraction with python
                            
                                PyCharm complains about patch.object but why?
                            
                                Seaborn factor plot custom error bars
                            
                                Can Pickle handle files larger than the RAM installed on my machine?
                            
                                What are the differences between Set, FrozenSet, MutableSet and AbstractSet in python typing module?
                            
                                How to get followers and following list in Instagram via http requests
                            
                                Python: how to create a choropleth map out of a shapefile of Canada?
                            
                                Dynamically add/remove plot using 'bokeh serve' (bokeh 0.12.0)
                            
                                Necessity of closing asyncio event loop explicitly
                            
                                Can I make the pytest doctest module ignore a file?
                            
                                How is Deploying Flask on AWS Elastic Beanstalk different from running script?
                            
                                Move/copy data from one folder to another on AWS S3
                            
                                Python Flask Not Updating Images [duplicate]
                            
                                Airflow: How to push xcom value from PostgreOperator?
                            
                                Checking for dead links locally in a static website (using wget?)
                            
                                what's the difference between tf.constant and tf.convert_to_tensor
                            
                                Can I run multiprocessing Python programs on a single core machine?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

When does Pandas default to broadcasting Series and Dataframes?

Tags:

python

pandas

array-broadcasting

Josh Friedlander

People also ask

2 Answers

Scott Boston

kabanus

Recent Activity

Donate For Us