I am sequentially merging some data frames by a key column, but I found that the columns aren't being renamed as expected with the suffix argument. I'm using Python 3.5 and Panadas 0.19.2.
import pandas as pd
df0 = pd.DataFrame({'serial_num': ['01', '02', '03'], 'vals': ['A2', 'A3', 'A6']})
df1 = pd.DataFrame({'serial_num': ['01', '02', '03'], 'vals': ['B2', 'B3', 'B6']})
df2 = pd.DataFrame({'serial_num': ['01', '02', '03'], 'vals': ['C2', 'C3', 'C6']})
df_01 = pd.merge(df0, df1, on="serial_num", suffixes=("_df0","_df1"), copy=True)
print(df_01.columns)
df_012 = pd.merge(df_01, df2, on="serial_num", suffixes=("_foobar","_df2"), copy=True)
print(df_012.columns)
The result is:
Index(['serial_num', 'vals_df0', 'vals_df1'], dtype='object')
Index(['serial_num', 'vals_df0', 'vals_df1', 'vals'], dtype='object')
This is odd because the Pandas documentation on merge doesn't suggest, that this should be the result. The expected result of print(df_012.columns)
is Index(['serial_num', 'vals_df0_foobar', 'vals_df1_foobar', 'vals_df2'], dtype='object')
. It does look like the data results are correct. I tried several things including copy = True
, copy.deepcopy(merged_df)
, and copy.deepcopy(merged_df)
. Using join (df_01 = df0.join(df1, on="serial_num", lsuffix="_df0", rsuffix="_df1"
) gives similar results. And it gets stranger, sequential merges eventually do rename the columns:
df3 = pd.DataFrame({'serial_num': ['01', '02', '03'], 'vals': ['D2', 'D3', 'D6']})
df_0123 = pd.merge(df_012, df3, on="serial_num", suffixes=("","_df3"), copy=True)
print(df_0123.columns)
df_01230 = pd.merge(df_0123, df1, on="serial_num", suffixes=("_foo","_bar"), copy=True)
print(df_01230.columns)
Returns:
Index(['serial_num', 'vals_df0', 'vals_df1', 'vals', 'vals_df3'], dtype='object')
Index(['serial_num', 'vals_df0', 'vals_df1', 'vals_foo', 'vals_df3',
'vals_bar'],
dtype='object')
If you followed all of that are you as surprised as me that merge suddenly renamed vals to vals_foo? Is Pandas failing to rename columns by the suffix argument in every other call?
The suffixes
argument is used only to distinguish names that are shared but which we're not merging on. As the documentation explains:
suffixes : 2-length sequence (tuple, list, ...) Suffix to apply to overlapping column names in the left and right side, respectively
The key word here is overlapping.
When you merge df0
and df1
on serial_num
, each frame has a vals
, which become vals_df0
and vals_df1
because of the suffixes.
Then you merge a frame with columns ['serial_num', 'vals_df0', 'vals_df1']
with one which has ['serial_num', 'vals']
. vals
doesn't exist in both, so there's no need to use any suffix to distinguish it, and we get ['serial_num', 'vals_df0', 'vals_df1', 'vals']
.
If you want a suffix to be applied to every non-shared column, you'll have to do it manually-- that's not what the suffixes
argument is for.
Suffixes keyword is only used in case the merged dataframe has columns with same name, it ensures that each column has unique name. In your second merge, you are merging df01 with columns 'serial_num', 'vals_df0', 'vals_df1' and df2 with columns 'serial_num', 'val'. You don't have any duplicate column names except 'serial_no' on which you are merging so you end up getting 'serial_num', 'vals_df0', 'vals_df1', 'vals'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With