Does Python Pandas merge (and join) fail to rename columns by suffix argument?

Question

I am sequentially merging some data frames by a key column, but I found that the columns aren't being renamed as expected with the suffix argument. I'm using Python 3.5 and Panadas 0.19.2.

import pandas as pd
df0 = pd.DataFrame({'serial_num': ['01', '02', '03'], 'vals': ['A2', 'A3', 'A6']})
df1 = pd.DataFrame({'serial_num': ['01', '02', '03'], 'vals': ['B2', 'B3', 'B6']})
df2 = pd.DataFrame({'serial_num': ['01', '02', '03'], 'vals': ['C2', 'C3', 'C6']})

df_01 = pd.merge(df0, df1, on="serial_num", suffixes=("_df0","_df1"), copy=True)
print(df_01.columns)
df_012 = pd.merge(df_01, df2, on="serial_num", suffixes=("_foobar","_df2"), copy=True)
print(df_012.columns)

The result is:

Index(['serial_num', 'vals_df0', 'vals_df1'], dtype='object')
Index(['serial_num', 'vals_df0', 'vals_df1', 'vals'], dtype='object')

This is odd because the Pandas documentation on merge doesn't suggest, that this should be the result. The expected result of print(df_012.columns) is Index(['serial_num', 'vals_df0_foobar', 'vals_df1_foobar', 'vals_df2'], dtype='object'). It does look like the data results are correct. I tried several things including copy = True, copy.deepcopy(merged_df), and copy.deepcopy(merged_df). Using join (df_01 = df0.join(df1, on="serial_num", lsuffix="_df0", rsuffix="_df1") gives similar results. And it gets stranger, sequential merges eventually do rename the columns:

df3 = pd.DataFrame({'serial_num': ['01', '02', '03'], 'vals': ['D2', 'D3', 'D6']})
df_0123 = pd.merge(df_012, df3, on="serial_num", suffixes=("","_df3"), copy=True)
print(df_0123.columns)
df_01230 = pd.merge(df_0123, df1, on="serial_num", suffixes=("_foo","_bar"), copy=True)
print(df_01230.columns)

Returns:

Index(['serial_num', 'vals_df0', 'vals_df1', 'vals', 'vals_df3'], dtype='object')
Index(['serial_num', 'vals_df0', 'vals_df1', 'vals_foo', 'vals_df3',
   'vals_bar'],
  dtype='object')

If you followed all of that are you as surprised as me that merge suddenly renamed vals to vals_foo? Is Pandas failing to rename columns by the suffix argument in every other call?

DSM · Accepted Answer

The suffixes argument is used only to distinguish names that are shared but which we're not merging on. As the documentation explains:

suffixes : 2-length sequence (tuple, list, ...) Suffix to apply to overlapping column names in the left and right side, respectively

The key word here is overlapping.

When you merge df0 and df1 on serial_num, each frame has a vals, which become vals_df0 and vals_df1 because of the suffixes.

Then you merge a frame with columns ['serial_num', 'vals_df0', 'vals_df1'] with one which has ['serial_num', 'vals']. vals doesn't exist in both, so there's no need to use any suffix to distinguish it, and we get ['serial_num', 'vals_df0', 'vals_df1', 'vals'].

If you want a suffix to be applied to every non-shared column, you'll have to do it manually-- that's not what the suffixes argument is for.

Vaishali · Answer

Suffixes keyword is only used in case the merged dataframe has columns with same name, it ensures that each column has unique name. In your second merge, you are merging df01 with columns 'serial_num', 'vals_df0', 'vals_df1' and df2 with columns 'serial_num', 'val'. You don't have any duplicate column names except 'serial_no' on which you are merging so you end up getting 'serial_num', 'vals_df0', 'vals_df1', 'vals'

Does Python Pandas merge (and join) fail to rename columns by suffix argument?

Tags:

python

merge

python-3.x

pandas

dataframe

Alex

2 Answers

DSM

Vaishali

Recent Activity

Donate For Us

Does Python Pandas merge (and join) fail to rename columns by suffix argument?

Tags:

python

merge

python-3.x

pandas

dataframe

Alex

2 Answers

DSM

Vaishali

Related questions

Recent Activity

Donate For Us