I have a dictionary of DataFrames with the key referring to the year of the data. I would like to iterate through the dict and make modifications to the DataFrames. I make modifications to both the column names and the contents of the dfs.
for year, df in df_data.items():
cols = df .columns
new_cols = [re.sub(r'\s\d{4}\-\d{2}', '', c) for c in cols]
df.columns = new_cols
for year, df in df_data.items():
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)
df = df.drop_duplicates(subset='Id', keep='first')
Can someone explain to me the behavior of doing this? Particularly, how the dfs are stored in memory and why the rename of columns works but the modification to the contents do not. Also, is there a best way to do this either by copying the df and then replacing it in the dict index or by constantly making the changes to the df_data[year] reference?
As @juanpa.arrivillaga describes above, drop_duplicates
returns a dataframe, which you're assigning to the local variable df
. Consider the following example:
a = [0, 1]
for b in a:
print(f'b: {b}')
b = 2
print(f'b: {b}')
print(f'a: {a}')
This is the output:
b: 0
b: 2
b: 1
b: 2
a: [0, 1]
You can see that the local var b
is being assigned the value 2
, but that the list a
is unchanged after the loop completes. This is because b
is a reference to the list, not the list itself. Assigning b = 2
causes b
to change to a reference to the integer 2
, but does not cause the list item that b
refers to to change to a reference to the integer 2
.
At the start of the first loop, the references look like this:
b -> a[0] -> the integer 0
Assigning b = 2
results in this:
a[0] -> the integer 0
b -> the integer 2
Not this:
b -> a[0] -> the integer 2
To mutate an object in loop, you must work only with methods which work in place, or you must work with a direct reference to the object:
for year in df_data.keys():
cols = df[year].columns
new_cols = [re.sub(r'\s\d{4}\-\d{2}', '', c) for c in cols]
df[year].columns = new_cols
for year in df_data.keys():
df[year]['Date'] = pd.to_datetime(df[year]['Date'], infer_datetime_format=True)
df[year] = df[year].drop_duplicates(subset='Id', keep='first')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With