Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas.DataFrame.copy(deep=True) doesn't actually create deep copy [duplicate]

I've been experimenting for a while with pd.Series and pd.DataFrame and faced some strange problem. Let's say I have the following pd.DataFrame:

df = pd.DataFrame({'col':[[1,2,3]]})

Notice, that this dataframe includes column containing list. I want to modify this dataframe's copy and return its modified version so that the initial one will remain unchanged. For the sake of simplicity, let's say I want to add integer '4' in its cell.

I've tried the following code:

def modify(df):
    dfc = df.copy(deep=True)
    dfc['col'].iloc[0].append(4)
    return dfc

modify(df)
print(df)

The problem is that, besides the new copy dfc, the initial DataFrame df is also modified. Why? What should I do to prevent initial dataframes from modifying? My pandas version is 0.25.0

like image 846
Илья Горшков Avatar asked Mar 03 '23 21:03

Илья Горшков


2 Answers

From the docs here, in the Notes section:

When deep=True, data is copied but actual Python objects will not be copied recursively, only the reference to the object. This is in contrast to copy.deepcopy in the Standard Library, which recursively copies object data (see examples below).

This is referenced again in this issue on GitHub, where the devs state that:

embedding mutable objects inside a. DataFrame is an antipattern

So this function is working as the devs intend - mutable objects such as lists should not be embedded in DataFrames.

I couldn't find a way to get copy.deepcopy to work as intended on a DataFrame, but I did find a fairly awful workaround using pickle:

import pandas as pd
import pickle

df = pd.DataFrame({'col':[[1,2,3]]})

def modify(df):
    dfc = pickle.loads(pickle.dumps(df))
    print(dfc['col'].iloc[0] is df['col'].iloc[0]) #Check if we've succeeded in deepcopying
    dfc['col'].iloc[0].append(4)
    print(dfc)
    return dfc

modify(df)
print(df)

Output:

False
            col
0  [1, 2, 3, 4]
         col
0  [1, 2, 3]
like image 174
CDJB Avatar answered Apr 13 '23 00:04

CDJB


This is an interesting question. The reason why this happens is that even though you're creating a copy of the dataframe, that does not change the fact that the inner lists are still referencing the same object, or put in another way, the objects are not copied recursively. This can be seen by checking the lists id:

df = pd.DataFrame({'col':[[1,2]]})

dfc = df.copy()
dfc['col'].iloc[0].append(4)

id(df.iloc[0,0])
# 1734189849288

id(dfc.iloc[0,0])
# 1734189849288

A workaround I can think of is to apply list.copy along the columns of interest:

id(df.iloc[0,0])
# 1734174279432

dfc = df.copy()
dfc['col'] = df.col.apply(list.copy)

id(dfc.iloc[0,0])
# 1734186015688
like image 30
yatu Avatar answered Apr 12 '23 22:04

yatu