I've been experimenting for a while with pd.Series and pd.DataFrame and faced some strange problem. Let's say I have the following pd.DataFrame:
df = pd.DataFrame({'col':[[1,2,3]]})
Notice, that this dataframe includes column containing list. I want to modify this dataframe's copy and return its modified version so that the initial one will remain unchanged. For the sake of simplicity, let's say I want to add integer '4' in its cell.
I've tried the following code:
def modify(df):
dfc = df.copy(deep=True)
dfc['col'].iloc[0].append(4)
return dfc
modify(df)
print(df)
The problem is that, besides the new copy dfc
, the initial DataFrame df
is also modified. Why? What should I do to prevent initial dataframes from modifying? My pandas version is 0.25.0
From the docs here, in the Notes section:
When deep=True, data is copied but actual Python objects will not be copied recursively, only the reference to the object. This is in contrast to copy.deepcopy in the Standard Library, which recursively copies object data (see examples below).
This is referenced again in this issue on GitHub, where the devs state that:
embedding mutable objects inside a. DataFrame is an antipattern
So this function is working as the devs intend - mutable objects such as lists should not be embedded in DataFrames.
I couldn't find a way to get copy.deepcopy
to work as intended on a DataFrame, but I did find a fairly awful workaround using pickle:
import pandas as pd
import pickle
df = pd.DataFrame({'col':[[1,2,3]]})
def modify(df):
dfc = pickle.loads(pickle.dumps(df))
print(dfc['col'].iloc[0] is df['col'].iloc[0]) #Check if we've succeeded in deepcopying
dfc['col'].iloc[0].append(4)
print(dfc)
return dfc
modify(df)
print(df)
Output:
False
col
0 [1, 2, 3, 4]
col
0 [1, 2, 3]
This is an interesting question. The reason why this happens is that even though you're creating a copy of the dataframe, that does not change the fact that the inner lists are still referencing the same object, or put in another way, the objects are not copied recursively. This can be seen by checking the lists id
:
df = pd.DataFrame({'col':[[1,2]]})
dfc = df.copy()
dfc['col'].iloc[0].append(4)
id(df.iloc[0,0])
# 1734189849288
id(dfc.iloc[0,0])
# 1734189849288
A workaround I can think of is to apply list.copy
along the columns of interest:
id(df.iloc[0,0])
# 1734174279432
dfc = df.copy()
dfc['col'] = df.col.apply(list.copy)
id(dfc.iloc[0,0])
# 1734186015688
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With