I have switched from R to pandas. I routinely get SettingWithCopyWarnings, when I do something like
df_a = pd.DataFrame({'col1': [1,2,3,4]})
# Filtering step, which may or may not return a view
df_b = df_a[df_a['col1'] > 1]
# Add a new column to df_b
df_b['new_col'] = 2 * df_b['col1']
# SettingWithCopyWarning!!
I think I understand the problem, though I'll gladly learn what I got wrong. In the given example, it is undefined whether df_b
is a view on df_a
or not. Thus, the effect of assigning to df_b
is unclear: does it affect df_a
? The problem can be solved by explicitly making a copy when filtering:
df_a = pd.DataFrame({'col1': [1,2,3,4]})
# Filtering step, definitely a copy now
df_b = df_a[df_a['col1'] > 1].copy()
# Add a new column to df_b
df_b['new_col'] = 2 * df_b['col1']
# No Warning now
I think there is something that I am missing: if we can never really be sure whether we create a view or not, what are views good for? From the pandas documentation (http://pandas-docs.github.io/pandas-docs-travis/indexing.html?highlight=view#indexing-view-versus-copy)
Outside of simple cases, it’s very hard to predict whether it [getitem] will return a view or a copy (it depends on the memory layout of the array, about which pandas makes no guarantees)
Similar warnings can be found for different indexing methods.
I find it very cumbersome and errorprone to sprinkle .copy() calls throughout my code. Am I using the wrong style for manipulating my DataFrames? Or is the performance gain so high that it justifies the apparent awkwardness?
Some operations in pandas (and numpy as well) will return views of the original data, while other copies. To put it very simply, a view is a subset of the original object ( DataFrame or Series ) linked to the original source, while a copy is an entirely new object .
Index is like an address, that's how any data point across the dataframe or series can be accessed. Rows and columns both have indexes, rows indices are called as index and for columns its general column names.
A value is trying to be set on a copy of a slice from a DataFrame. One approach that can be used to suppress SettingWithCopyWarning is to perform the chained operations into just a single loc operation. This will ensure that the assignment happens on the original DataFrame instead of a copy.
Generally, to avoid a SettingWithCopyWarning in Pandas, you should do the following: Avoid chained assignments that combine two or more indexing operations like df["z"][mask] = 0 and df. loc[mask]["z"] = 0 . Apply single assignments with just one indexing operation like df.
Great question!
The short answer is: this is a flaw in pandas that's being remedied.
You can find a longer discussion of the nature of the problem here, but the main take-away is that we're now moving to a "copy-on-write" behavior in which any time you slice, you get a new copy, and you never have to think about views. The fix will soon come through this refactoring project. I actually tried to fix it directly (see here), but it just wasn't feasible in the current architecture.
In truth, we'll keep views in the background -- they make pandas SUPER memory efficient and fast when they can be provided -- but we'll end up hiding them from users so, from the user perspective, if you slice, index, or cut a DataFrame, what you get back will effectively be a new copy.
(This is accomplished by creating views when the user is only reading data, but whenever an assignment operation is used, the view will be converted to a copy before the assignment takes place.)
Best guess is the fix will be in within a year -- in the mean time, I'm afraid some .copy()
may be necessary, sorry!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With