Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best practice for looping through a dictionary of pandas dataframes and making modifications?

Tags:

python

pandas

I have a dictionary of DataFrames with the key referring to the year of the data. I would like to iterate through the dict and make modifications to the DataFrames. I make modifications to both the column names and the contents of the dfs.

for year, df in df_data.items():
    cols = df .columns
    new_cols = [re.sub(r'\s\d{4}\-\d{2}', '', c) for c in cols]
    df.columns = new_cols

for year, df in df_data.items():
    df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)
    df = df.drop_duplicates(subset='Id', keep='first')

Can someone explain to me the behavior of doing this? Particularly, how the dfs are stored in memory and why the rename of columns works but the modification to the contents do not. Also, is there a best way to do this either by copying the df and then replacing it in the dict index or by constantly making the changes to the df_data[year] reference?

like image 646
Boom Avatar asked Nov 07 '22 07:11

Boom


1 Answers

As @juanpa.arrivillaga describes above, drop_duplicates returns a dataframe, which you're assigning to the local variable df. Consider the following example:

a = [0, 1]
for b in a:
    print(f'b: {b}')
    b = 2
    print(f'b: {b}') 

print(f'a: {a}')

This is the output:

b: 0
b: 2
b: 1
b: 2
a: [0, 1]

You can see that the local var b is being assigned the value 2, but that the list a is unchanged after the loop completes. This is because b is a reference to the list, not the list itself. Assigning b = 2 causes b to change to a reference to the integer 2, but does not cause the list item that b refers to to change to a reference to the integer 2. At the start of the first loop, the references look like this:

b -> a[0] -> the integer 0

Assigning b = 2 results in this:

a[0] -> the integer 0
   b -> the integer 2

Not this:

b -> a[0] -> the integer 2

To mutate an object in loop, you must work only with methods which work in place, or you must work with a direct reference to the object:

for year in df_data.keys():
    cols = df[year].columns
    new_cols = [re.sub(r'\s\d{4}\-\d{2}', '', c) for c in cols]
    df[year].columns = new_cols

for year in df_data.keys():
    df[year]['Date'] = pd.to_datetime(df[year]['Date'], infer_datetime_format=True)
    df[year] = df[year].drop_duplicates(subset='Id', keep='first')
like image 128
Dave Avatar answered Nov 14 '22 22:11

Dave