Pandas: Knowing when an operation affects the original dataframe

Tags:

I love pandas and have been using it for years and feel pretty confident I have a good handle on how to subset dataframes and deal with views vs copies appropriately (though I use a lot of assertions to be sure). I also know that there have been tons of questions about SettingWithCopyWarning, e.g. How to deal with SettingWithCopyWarning in Pandas? and some great recent guides on wrapping your head around when it happens, e.g. Understanding SettingWithCopyWarning in pandas.

But I also know specific things like the quote from this answer are no longer in the most recent docs (0.22.0) and that many things have been deprecated over the years (leading to some inappropriate old SO answers), and that things are continuing to change.

Recently after teaching pandas to complete newcomers with very basic general Python knowledge about things like avoiding chained-indexing (and using .iloc/.loc), I've still struggled to provide general rules of thumb to know when it's important to pay attention to the SettingWithCopyWarning (e.g. when it's safe to ignore it).

I've personally found that the specific pattern of subsetting a dataframe according so some rule (e.g. slicing or boolean operation) and then modifying that subset, independent of the original dataframe, is a much more common operation than the docs suggest. In this situation we want to modify the copy not the original and the warning is confusing/scary to newcomers.

I know it's not trivial to know ahead of time when a view vs a copy is returned, e.g.
What rules does Pandas use to generate a view vs a copy?
Checking whether data frame is copy or view in Pandas

So instead I'm looking for the answer to a more general (beginner friendly) question: when does performing an operation on a subsetted dataframe affect the original dataframe from which it was created, and when are they independent?.

I've created some cases below that I think seem reasonable, but I'm not sure if there's a "gotcha" I'm missing or if there's any easier way to think/check this. I was hoping someone could confirm that my intuitions about the following use cases are correct as the pertain to my question above.

import pandas as pd df1 = pd.DataFrame({'A':[2,4,6,8,10],'B':[1,3,5,7,9],'C':[10,20,30,40,50]})

1) Warning: No
Original changed: No

# df1 will be unaffected because we use .copy() method explicitly  df2 = df1.copy() # # Reference: docs df2.iloc[0,1] = 100

2) Warning: Yes (I don't really understood why)
Original changed: No

# df1 will be unaffected because .query() always returns a copy # # Reference: # https://stackoverflow.com/a/23296545/8022335 df2 = df1.query('A < 10') df2.iloc[0,1] = 100

3) Warning: Yes
Original changed: No

# df1 will be unaffected because boolean indexing with .loc # always returns a copy # # Reference: # https://stackoverflow.com/a/17961468/8022335 df2 = df1.loc[df1['A'] < 10,:] df2.iloc[0,1] = 100

4) Warning: No
Original changed: No

# df1 will be unaffected because list indexing with .loc (or .iloc) # always returns a copy # # Reference: # Same as 4) df2 = df1.loc[[0,3,4],:] df2.iloc[0,1] = 100

5) Warning: No
Original changed: Yes (confusing to newcomers but makes sense)

# df1 will be affected because scalar/slice indexing with .iloc/.loc # always references the original dataframe, but may sometimes  # provide a view and sometimes provide a copy # # Reference: docs df2 = df1.loc[:10,:] df2.iloc[0,1] = 100

tl;dr When creating a new dataframe from the original, changing the new dataframe:
Will change the original when scalar/slice indexing with .loc/.iloc is used to create the new dataframe.
Will not change the original when boolean indexing with .loc, .query(), or .copy() is used to create the new dataframe

948

asked Jan 09 '18 17:01

ejolly

1 Answers

This is a somewhat confusing and even frustrating part of pandas, but for the most part you shouldn't really have to worry about this if you follow some simple workflow rules. In particular, note that there are only two general cases here when you have two dataframes, with one being a subset of the other.

This is a case where the Zen of Python rule "explicit is better than implicit" is a great guideline to follow.

Case A: Changes to `df2` should NOT affect `df1`

This is trivial, of course. You want two completely independent dataframes so you just explicitly make a copy:

df2 = df1.copy()

After this anything you do to df2 affects only df2 and not df1 and vice versa.

Case B: Changes to `df2` should ALSO affect `df1`

In this case I don't think there is one general way to solve the problem because it depends on exactly what you're trying to do. However, there are a couple of standard approaches that are pretty straightforward and should not have any ambiguity about how they are working.

Method 1: Copy df1 to df2, then use df2 to update df1

In this case, you can basically do a one to one conversion of the examples above. Here's example #2:

df2 = df1.copy() df2 = df1.query('A < 10') df2.iloc[0,1] = 100  df1 = df2.append(df1).reset_index().drop_duplicates(subset='index').drop(columns='index')

Unfortunately the re-merging via append is a bit verbose there. You can do it more cleanly with the following, although it has the side effect of converting integers to floats.

df1.update(df2)   # note that this is an inplace operation

Method 2: Use a mask (don't create df2 at all)

I think the best general approach here is not to create df2 at all, but rather have it be a masked version of df1. Somewhat unfortunately, you can't do a direct translation of the above code due to its mixing of loc and iloc which is fine for this example though probably unrealistic for actual use.

The advantage is that you can write very simple and readable code. Here's an alternative version of example #2 above where df2 is actually just a masked version of df1. But instead of changing via iloc, I'll change if column "C" == 10.

df2_mask = df1['A'] < 10 df1.loc[ df2_mask & (df1['C'] == 10), 'B'] = 100

Now if you print df1 or df1[df2_mask] you will see that column "B" = 100 for the first row of each dataframe. Obviously this is not very surprising here, but that's the inherent advantage of following "explicit is better than implicit".

111

answered Oct 14 '22 02:10

JohnE

Related questions
                            
                                Python-Requests close http connection
                            
                                Extracting a region from an image using slicing in Python, OpenCV
                            
                                Reading a text file and splitting it into single words in python
                            
                                How to save requests (python) cookies to a file?
                            
                                how to get access to error message from abort command when using custom error handler
                            
                                Removing duplicate characters from a string
                            
                                How to do gradient clipping in pytorch?
                            
                                Open web in new tab Selenium + Python
                            
                                requests: how to disable / bypass proxy
                            
                                Hidden features of PyCharm [closed]
                            
                                Can I use Django F() objects with string concatenation?
                            
                                PyQt4.QtCore.pyqtSignal object has no attribute 'connect'
                            
                                What's an efficient way to find if a point lies in the convex hull of a point cloud?
                            
                                Python: SWIG vs ctypes
                            
                                How to split a string of space separated numbers into integers?
                            
                                Django, template context processors
                            
                                How to format elapsed time from seconds to hours, minutes, seconds and milliseconds in Python?
                            
                                ubuntu /usr/bin/env: python: No such file or directory
                            
                                How to remove any URL within a string in Python
                            
                                Django: Converting an entire set of a Model's objects into a single dictionary

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas: Knowing when an operation affects the original dataframe

Tags:

python

pandas

chained-assignment

views

ejolly

People also ask

1 Answers

Case A: Changes to `df2` should NOT affect `df1`

Case B: Changes to `df2` should ALSO affect `df1`

JohnE

Recent Activity

Donate For Us

Pandas: Knowing when an operation affects the original dataframe

Tags:

python

pandas

chained-assignment

views

ejolly

People also ask

1 Answers

Case A: Changes to df2 should NOT affect df1

Case B: Changes to df2 should ALSO affect df1

JohnE

Related questions

Recent Activity

Donate For Us

Case A: Changes to `df2` should NOT affect `df1`

Case B: Changes to `df2` should ALSO affect `df1`