It's well known (and understandable) that pandas behavior is essentially unpredictable when assigning to a slice. But I'm used to being warned about it by SettingWithCopy
warning.
Why is the warning not generated in either of the following two code snippets, and what techniques could reduce the chance of writing such code unintentionally?
# pandas 0.18.1, python 3.5.1
import pandas as pd
data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
new_data = data[['a', 'b']]
data = data['a']
new_data.loc[0, 'a'] = 100 # no warning, doesn't propagate to data
data[0] == 1
True
data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
new_data = data['a']
data = data['a']
new_data.loc[0] = 100 # no warning, propagates to data
data[0] == 100
True
I thought the explanation was that pandas only produces the warning when the parent DataFrame is still reachable from the current context. (This would be a weakness of the detection algorithm, as my previous examples show.)
In the next snippet, AFAIK the original two-column DataFrame is no longer reachable, and yet pandas warning mechanism manages to trigger (luckily):
data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
new_data = data['a']
data = data[['a']]
new_data.loc[0] = 100 # warning, so we're safe
Edit:
While investigating this, I found another case of a missing warning:
data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
data = data.groupby('a')
new_data = data.filter(lambda g: len(g)==1)
new_data.loc[0, 'a'] = 100 # no warning, does not propagate to data
assert data.filter(lambda g: True).loc[0, 'a'] == 1
Even though an almost identical example does trigger a warning:
data = pd.DataFrame({'a': [1, 2, 2], 'b': ['a', 'b', 'c']})
data = data.groupby('a')
new_data = data.filter(lambda g: len(g)==1)
new_data.loc[0, 'a'] = 100 # warning, does not propagate to data
assert data.filter(lambda g: True).loc[0, 'a'] == 1
Update: I'm responding to the answer by @firelynx here because it's hard to put it in the comment.
In the answer, @firelynx says that the first code snippet results in no warning because I'm taking the entire dataframe. But even if I took part of it, I still don't get a warning:
# pandas 0.18.1, python 3.5.1
import pandas as pd
data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c'], c: range(3)})
new_data = data[['a', 'b']]
data = data['a']
new_data.loc[0, 'a'] = 100 # no warning, doesn't propagate to data
data[0] == 1
True
One approach that can be used to suppress SettingWithCopyWarning is to perform the chained operations into just a single loc operation. This will ensure that the assignment happens on the original DataFrame instead of a copy. Therefore, if we attempt doing so the warning should no longer be raised.
This is what the warning is telling us. 'A value is trying to be set on a copy of a slice of a dataframe'. We discussed above that Pandas can either create a view or a copy when we are trying to access (get) a subset of an operation.
The Dataframe you create, is not a view
data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
data._is_view
False
new_data is also not a view, because you are taking all columns
new_data = data[['a', 'b']]
new_data._is_view
False
now you are assigning data to be the Series 'a'
data = data['a']
type(data)
pandas.core.series.Series
Which is a view
data._is_view
True
Now you update a value in the non-copy new_data
new_data.loc[0, 'a'] = 100 # no warning, doesn't propagate to data
This should not give a warning. It is the whole dataframe.
The Series
you've created flags itself as a view, but it's not a DataFrame and does not behave as a DataFrame view.
The Series vs. Dataframe problem is a very common one in pandas[citation not needed if you've worked with pandas for a while]
The problem is really that you should always be writing
data[['a']]
not data['a']
Left creates a dataframe view, right creates a series.
Some people may argue to never ever write data['a']
but do data.a
instead. Thus you can add warnings to your environment for data['a']
code.
This does not work. First of all using data.a
syntax causes cognitive dissonance.
A dataframe is a collection of columns. In python we access members of collections with the []
operator. We access attributes by the .
operator. Switching these around causes cognitive dissonance for anyone who is a python programmer. Especially when you start doing things like del data.a
and notice that it does not work. See this answer for more extensive explaination
It is hard to see the difference between data[['a']]
and data['a']
This is a smell. We should be doing neither.
The proper way using clean code principles and the zen of python "Explicit is better than implicit"
is this:
columns = ['a']
data[columns]
This may not be so mind boggling, but take a look at the following example:
data[['ad', 'cpc', 'roi']]
What does this mean? What are these columns? What data are you getting here?
These are the first questions to arrive in anyone's head when reading this line of code.
How to solve it? Don't say a comment.
ad_performance_columns = ['ad', 'cpc', 'roi']
data[ad_performance_columns]
More explicit is always better.
For more, please consider buying a book on clean code. Maybe this one
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With