Unpredictable pandas slice assignment behavior with no SettingWithCopyWarning

Tags:

It's well known (and understandable) that pandas behavior is essentially unpredictable when assigning to a slice. But I'm used to being warned about it by SettingWithCopy warning.

Why is the warning not generated in either of the following two code snippets, and what techniques could reduce the chance of writing such code unintentionally?

# pandas 0.18.1, python 3.5.1
import pandas as pd
data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
new_data = data[['a', 'b']]
data = data['a']
new_data.loc[0, 'a'] = 100 # no warning, doesn't propagate to data

data[0] == 1
True


data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
new_data = data['a']
data = data['a']
new_data.loc[0] = 100 # no warning, propagates to data

data[0] == 100
True

I thought the explanation was that pandas only produces the warning when the parent DataFrame is still reachable from the current context. (This would be a weakness of the detection algorithm, as my previous examples show.)

In the next snippet, AFAIK the original two-column DataFrame is no longer reachable, and yet pandas warning mechanism manages to trigger (luckily):

data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
new_data = data['a']
data = data[['a']]
new_data.loc[0] = 100 # warning, so we're safe

Edit:

While investigating this, I found another case of a missing warning:

data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
data = data.groupby('a')
new_data = data.filter(lambda g: len(g)==1)
new_data.loc[0, 'a'] = 100 # no warning, does not propagate to data
assert data.filter(lambda g: True).loc[0, 'a'] == 1

Even though an almost identical example does trigger a warning:

data = pd.DataFrame({'a': [1, 2, 2], 'b': ['a', 'b', 'c']})
data = data.groupby('a')
new_data = data.filter(lambda g: len(g)==1)
new_data.loc[0, 'a'] = 100 # warning, does not propagate to data
assert data.filter(lambda g: True).loc[0, 'a'] == 1

Update: I'm responding to the answer by @firelynx here because it's hard to put it in the comment.

In the answer, @firelynx says that the first code snippet results in no warning because I'm taking the entire dataframe. But even if I took part of it, I still don't get a warning:

# pandas 0.18.1, python 3.5.1
import pandas as pd
data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c'], c: range(3)})
new_data = data[['a', 'b']]
data = data['a']
new_data.loc[0, 'a'] = 100 # no warning, doesn't propagate to data

data[0] == 1
True

498

asked Sep 04 '16 22:09

max

1 Answers

Explaining what you're doing, step by step

The Dataframe you create, is not a view

data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
data._is_view
False

new_data is also not a view, because you are taking all columns

new_data = data[['a', 'b']]
new_data._is_view
False

now you are assigning data to be the Series 'a'

data = data['a']
type(data)
pandas.core.series.Series

Which is a view

data._is_view
True

Now you update a value in the non-copy new_data

new_data.loc[0, 'a'] = 100 # no warning, doesn't propagate to data

This should not give a warning. It is the whole dataframe.

The Series you've created flags itself as a view, but it's not a DataFrame and does not behave as a DataFrame view.

Avoiding writing code like this

The Series vs. Dataframe problem is a very common one in pandas[citation not needed if you've worked with pandas for a while]

The problem is really that you should always be writing

data[['a']] not data['a']

Left creates a dataframe view, right creates a series.

Some people may argue to never ever write data['a'] but do data.a instead. Thus you can add warnings to your environment for data['a'] code.

This does not work. First of all using data.a syntax causes cognitive dissonance.

A dataframe is a collection of columns. In python we access members of collections with the [] operator. We access attributes by the . operator. Switching these around causes cognitive dissonance for anyone who is a python programmer. Especially when you start doing things like del data.a and notice that it does not work. See this answer for more extensive explaination

Clean code to the rescue

It is hard to see the difference between data[['a']] and data['a']

This is a smell. We should be doing neither.

The proper way using clean code principles and the zen of python "Explicit is better than implicit"

is this:

columns = ['a']
data[columns]

This may not be so mind boggling, but take a look at the following example:

data[['ad', 'cpc', 'roi']]

What does this mean? What are these columns? What data are you getting here?

These are the first questions to arrive in anyone's head when reading this line of code.

How to solve it? Don't say a comment.

ad_performance_columns = ['ad', 'cpc', 'roi']
data[ad_performance_columns]

More explicit is always better.

For more, please consider buying a book on clean code. Maybe this one

177

answered Oct 10 '22 00:10

firelynx

Related questions
                            
                                What is the difference between detach, clone and deepcopy in Pytorch tensors in detail?
                            
                                Python packaging in 2020
                            
                                Stochastic calculus library in python
                            
                                Django - check if list contains something in a template
                            
                                Make a py2exe exe run without a console?
                            
                                Asynchronously redirect stdout/stdin from embedded python to c++?
                            
                                How to get centroids from SciPy's hierarchical agglomerative clustering?
                            
                                What is a real-world example of Dependency Injection in a Dynamic Language?
                            
                                Disabling Javascript after page has been rendered in Selenium Webdriver
                            
                                What is this (cid:51) in the output of pdf2txt?
                            
                                Is there any documentation of numpy numerical stability?
                            
                                PyCharm SSH tunneling via local ssh config (~/.ssh/config)
                            
                                Why is merging Python system classes with custom classes less desirable than hooking the import mechanism?
                            
                                Importing a Python package from a script with the same name
                            
                                Ordering and pagination in SQL-alchemy using non-sql ranking
                            
                                Python warnings- how to not print the source line? [duplicate]
                            
                                Prevent PyCharm from showing builtin modules on KeyboardInterrupt and other occasions
                            
                                Low InnoDB Writes per Second - AWS EC2 to MySQL RDS using Python
                            
                                How to distribute files in a Python sdist that are not VCS tracked?
                            
                                Is it possible to prioritise a lock?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Unpredictable pandas slice assignment behavior with no SettingWithCopyWarning

Tags:

python

pandas

chained-assignment

max

People also ask