Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does Python Pandas merge (and join) fail to rename columns by suffix argument?

I am sequentially merging some data frames by a key column, but I found that the columns aren't being renamed as expected with the suffix argument. I'm using Python 3.5 and Panadas 0.19.2.

import pandas as pd
df0 = pd.DataFrame({'serial_num': ['01', '02', '03'], 'vals': ['A2', 'A3', 'A6']})
df1 = pd.DataFrame({'serial_num': ['01', '02', '03'], 'vals': ['B2', 'B3', 'B6']})
df2 = pd.DataFrame({'serial_num': ['01', '02', '03'], 'vals': ['C2', 'C3', 'C6']})

df_01 = pd.merge(df0, df1, on="serial_num", suffixes=("_df0","_df1"), copy=True)
print(df_01.columns)
df_012 = pd.merge(df_01, df2, on="serial_num", suffixes=("_foobar","_df2"), copy=True)
print(df_012.columns)

The result is:

Index(['serial_num', 'vals_df0', 'vals_df1'], dtype='object')
Index(['serial_num', 'vals_df0', 'vals_df1', 'vals'], dtype='object')

This is odd because the Pandas documentation on merge doesn't suggest, that this should be the result. The expected result of print(df_012.columns) is Index(['serial_num', 'vals_df0_foobar', 'vals_df1_foobar', 'vals_df2'], dtype='object'). It does look like the data results are correct. I tried several things including copy = True, copy.deepcopy(merged_df), and copy.deepcopy(merged_df). Using join (df_01 = df0.join(df1, on="serial_num", lsuffix="_df0", rsuffix="_df1") gives similar results. And it gets stranger, sequential merges eventually do rename the columns:

df3 = pd.DataFrame({'serial_num': ['01', '02', '03'], 'vals': ['D2', 'D3', 'D6']})
df_0123 = pd.merge(df_012, df3, on="serial_num", suffixes=("","_df3"), copy=True)
print(df_0123.columns)
df_01230 = pd.merge(df_0123, df1, on="serial_num", suffixes=("_foo","_bar"), copy=True)
print(df_01230.columns)

Returns:

Index(['serial_num', 'vals_df0', 'vals_df1', 'vals', 'vals_df3'], dtype='object')
Index(['serial_num', 'vals_df0', 'vals_df1', 'vals_foo', 'vals_df3',
   'vals_bar'],
  dtype='object')

If you followed all of that are you as surprised as me that merge suddenly renamed vals to vals_foo? Is Pandas failing to rename columns by the suffix argument in every other call?

like image 859
Alex Avatar asked Dec 24 '22 17:12

Alex


2 Answers

The suffixes argument is used only to distinguish names that are shared but which we're not merging on. As the documentation explains:

suffixes : 2-length sequence (tuple, list, ...) Suffix to apply to overlapping column names in the left and right side, respectively

The key word here is overlapping.

When you merge df0 and df1 on serial_num, each frame has a vals, which become vals_df0 and vals_df1 because of the suffixes.

Then you merge a frame with columns ['serial_num', 'vals_df0', 'vals_df1'] with one which has ['serial_num', 'vals']. vals doesn't exist in both, so there's no need to use any suffix to distinguish it, and we get ['serial_num', 'vals_df0', 'vals_df1', 'vals'].

If you want a suffix to be applied to every non-shared column, you'll have to do it manually-- that's not what the suffixes argument is for.

like image 103
DSM Avatar answered Dec 26 '22 20:12

DSM


Suffixes keyword is only used in case the merged dataframe has columns with same name, it ensures that each column has unique name. In your second merge, you are merging df01 with columns 'serial_num', 'vals_df0', 'vals_df1' and df2 with columns 'serial_num', 'val'. You don't have any duplicate column names except 'serial_no' on which you are merging so you end up getting 'serial_num', 'vals_df0', 'vals_df1', 'vals'

like image 44
Vaishali Avatar answered Dec 26 '22 19:12

Vaishali