Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas merge unexpectedly produces suffixes

I am merging two Pandas DataFrames together and am getting "_x" and "_y" suffixes. Easy to replicate example below. I tried adding , suffixes=(False, False) into the merge, but it returns an error: ValueError: columns overlap but no suffix specified: Index(['f1', 'f2', 'f3'], dtype='object'). I must be missing something obvious here? I understand why this would occur using join, but I didn't expect it for merge.

Please ignore the copy slice error. I can't figure out why it doesn't throw this error on Line 10, but does throw it on Line 17. (If you know, there's an open question here on it!)

System details: Windows 10
conda 4.8.2
Python 3.8.3
pandas 1.0.5 py38he6e81aa_0 conda-forge

import pandas as pd

#### Build an example DataFrame for easy-to-replicate example ####
myid = [1, 1, 1, 2, 2]
myorder = [3, 2, 1, 2, 1]
y = [3642, 3640, 3632, 3628, 3608]
x = [11811, 11812, 11807, 11795, 11795]
df = pd.DataFrame(list(zip(myid, myorder, x, y)), 
                  columns =['myid', 'myorder', 'x', 'y']) 
df.sort_values(by=['myid', 'myorder'], inplace=True) #Line10
df.reset_index(drop=True, inplace=True)
display(df.style.hide_index())

### Typical analysis on existing DataFrame, Error occurs in here ####
for id in df.myid.unique():
    tempdf = df[mygdf.myid == id]
    tempdf.sort_values(by=['myid', 'myorder'], inplace=True) #Line17
    tempdf.reset_index(drop=True, inplace=True)
    for i, r in tempdf.iloc[1:].iterrows():
        ## in reality, calling a more complicated function here
        ## this is just a simple example
        tempdf.loc[i, 'f1'] = tempdf.x[i-1] - tempdf.x[i]
        tempdf.loc[i, 'f2'] = tempdf.y[i-1] - tempdf.y[i]
        tempdf.loc[i, 'f3'] = tempdf.y[i] +2
   
    what_i_care_about = ['myid', 'myorder', 'f1', 'f2', 'f3']

    df = pd.merge(df, tempdf[what_i_care_about], 
                  on=['myid', 'myorder'], how='outer')
    del tempdf

display(df.style.hide_index())

enter image description here

like image 477
a11 Avatar asked Jul 07 '20 15:07

a11


People also ask

Does Pandas merge preserve order?

Answer. Yes. Order of the merged dataframes will effect the order of the rows and columns of the merged dataframe. When using the merge() method, it will preserve the order of the left keys.

What is the difference between merge and join in Pandas?

We can use join and merge to combine 2 dataframes. The join method works best when we are joining dataframes on their indexes (though you can specify another column to join on for the left dataframe). The merge method is more versatile and allows us to specify columns besides the index to join on for both dataframes.

When using the merge () function on two DataFrames Which of the following joins is likely to preserve the most keys in the result?

Inner joins The most common type of join is called an inner join. An inner join combines two DataFrames based on a join key and returns a new DataFrame that contains only those rows that have matching values in both of the original DataFrames.

What does merge do in Pandas?

The merge() method updates the content of two DataFrame by merging them together, using the specified method(s). Use the parameters to control which values to keep and which to replace.

When do you use the suffix after a Dataframe name?

The suffix is needed only when the merged dataframe has two columns with same name. When you merge df3, your dataframe has column names val_1 and val_2 so there is no overlap.

What does “_merge” add to the output Dataframe?

If True, adds a column to the output DataFrame called “_merge” with information on the source of each row. The column can be given a different name by providing a string argument.

Why is the last amount column in a pandas data frame just amount?

But the Pandas gets a bit trickier since the last amount column does not conflict with any previous columns, so it ends up in the resulting data frame as just amount (unlike the two previous amount columns, which conflict and get renamed to amount__1 and amount__2 ).

How to merge two Dataframe with same column names?

The suffix is needed only when the merged dataframe has two columns with same name. When you merge df3, your dataframe has column names val_1 and val_2 so there is no overlap. You can handle that by renaming val to val_3 like this


1 Answers

Your problem is that there are columns you are not merging on that are common to both source DataFrames. Pandas needs a way to say which one came from where, so it adds the suffixes, the defaults being '_x' on the left and '_y' on the right.

If you have a preference on which source data frame to keep the columns from, then you can set the suffixes and filter accordingly, for example if you want to keep the clashing columns from the left:

# Label the two sides, with no suffix on the side you want to keep
df = pd.merge(
    df, 
    tempdf[what_i_care_about], 
    on=['myid', 'myorder'], 
    how='outer',
    suffixes=('', '_delme')  # Left gets no suffix, right gets something identifiable
)
# Discard the columns that acquired a suffix
df = df[[c for c in df.columns if not c.endswith('_delme')]]

Alternatively, you can drop one of each of the clashing columns prior to merging, then Pandas has no need to assign a suffix.

like image 153
Chris Cooper Avatar answered Oct 21 '22 07:10

Chris Cooper