Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unpredictable pandas slice assignment behavior with no SettingWithCopyWarning

It's well known (and understandable) that pandas behavior is essentially unpredictable when assigning to a slice. But I'm used to being warned about it by SettingWithCopy warning.

Why is the warning not generated in either of the following two code snippets, and what techniques could reduce the chance of writing such code unintentionally?

# pandas 0.18.1, python 3.5.1
import pandas as pd
data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
new_data = data[['a', 'b']]
data = data['a']
new_data.loc[0, 'a'] = 100 # no warning, doesn't propagate to data

data[0] == 1
True


data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
new_data = data['a']
data = data['a']
new_data.loc[0] = 100 # no warning, propagates to data

data[0] == 100
True

I thought the explanation was that pandas only produces the warning when the parent DataFrame is still reachable from the current context. (This would be a weakness of the detection algorithm, as my previous examples show.)

In the next snippet, AFAIK the original two-column DataFrame is no longer reachable, and yet pandas warning mechanism manages to trigger (luckily):

data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
new_data = data['a']
data = data[['a']]
new_data.loc[0] = 100 # warning, so we're safe

Edit:

While investigating this, I found another case of a missing warning:

data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
data = data.groupby('a')
new_data = data.filter(lambda g: len(g)==1)
new_data.loc[0, 'a'] = 100 # no warning, does not propagate to data
assert data.filter(lambda g: True).loc[0, 'a'] == 1

Even though an almost identical example does trigger a warning:

data = pd.DataFrame({'a': [1, 2, 2], 'b': ['a', 'b', 'c']})
data = data.groupby('a')
new_data = data.filter(lambda g: len(g)==1)
new_data.loc[0, 'a'] = 100 # warning, does not propagate to data
assert data.filter(lambda g: True).loc[0, 'a'] == 1

Update: I'm responding to the answer by @firelynx here because it's hard to put it in the comment.

In the answer, @firelynx says that the first code snippet results in no warning because I'm taking the entire dataframe. But even if I took part of it, I still don't get a warning:

# pandas 0.18.1, python 3.5.1
import pandas as pd
data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c'], c: range(3)})
new_data = data[['a', 'b']]
data = data['a']
new_data.loc[0, 'a'] = 100 # no warning, doesn't propagate to data

data[0] == 1
True
like image 498
max Avatar asked Sep 04 '16 22:09

max


People also ask

Can you ignore SettingWithCopyWarning?

One approach that can be used to suppress SettingWithCopyWarning is to perform the chained operations into just a single loc operation. This will ensure that the assignment happens on the original DataFrame instead of a copy. Therefore, if we attempt doing so the warning should no longer be raised.

What is setting with copy warning?

This is what the warning is telling us. 'A value is trying to be set on a copy of a slice of a dataframe'. We discussed above that Pandas can either create a view or a copy when we are trying to access (get) a subset of an operation.


1 Answers

Explaining what you're doing, step by step

The Dataframe you create, is not a view

data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
data._is_view
False

new_data is also not a view, because you are taking all columns

new_data = data[['a', 'b']]
new_data._is_view
False

now you are assigning data to be the Series 'a'

data = data['a']
type(data)
pandas.core.series.Series

Which is a view

data._is_view
True

Now you update a value in the non-copy new_data

new_data.loc[0, 'a'] = 100 # no warning, doesn't propagate to data

This should not give a warning. It is the whole dataframe.

The Series you've created flags itself as a view, but it's not a DataFrame and does not behave as a DataFrame view.

Avoiding writing code like this

The Series vs. Dataframe problem is a very common one in pandas[citation not needed if you've worked with pandas for a while]

The problem is really that you should always be writing

data[['a']] not data['a']

Left creates a dataframe view, right creates a series.

Some people may argue to never ever write data['a'] but do data.a instead. Thus you can add warnings to your environment for data['a'] code.

This does not work. First of all using data.a syntax causes cognitive dissonance.

A dataframe is a collection of columns. In python we access members of collections with the [] operator. We access attributes by the . operator. Switching these around causes cognitive dissonance for anyone who is a python programmer. Especially when you start doing things like del data.a and notice that it does not work. See this answer for more extensive explaination

Clean code to the rescue

It is hard to see the difference between data[['a']] and data['a']

This is a smell. We should be doing neither.

The proper way using clean code principles and the zen of python "Explicit is better than implicit"

is this:

columns = ['a']
data[columns]

This may not be so mind boggling, but take a look at the following example:

data[['ad', 'cpc', 'roi']]

What does this mean? What are these columns? What data are you getting here?

These are the first questions to arrive in anyone's head when reading this line of code.

How to solve it? Don't say a comment.

ad_performance_columns = ['ad', 'cpc', 'roi']
data[ad_performance_columns]

More explicit is always better.

For more, please consider buying a book on clean code. Maybe this one

like image 177
firelynx Avatar answered Oct 10 '22 00:10

firelynx