I am merging two Pandas DataFrames together and am getting "_x" and "_y" suffixes. Easy to replicate example below. I tried adding , suffixes=(False, False)
into the merge, but it returns an error: ValueError: columns overlap but no suffix specified: Index(['f1', 'f2', 'f3'], dtype='object')
. I must be missing something obvious here? I understand why this would occur using join, but I didn't expect it for merge.
Please ignore the copy slice error. I can't figure out why it doesn't throw this error on Line 10, but does throw it on Line 17. (If you know, there's an open question here on it!)
System details:
Windows 10
conda 4.8.2
Python 3.8.3
pandas 1.0.5 py38he6e81aa_0 conda-forge
import pandas as pd
#### Build an example DataFrame for easy-to-replicate example ####
myid = [1, 1, 1, 2, 2]
myorder = [3, 2, 1, 2, 1]
y = [3642, 3640, 3632, 3628, 3608]
x = [11811, 11812, 11807, 11795, 11795]
df = pd.DataFrame(list(zip(myid, myorder, x, y)),
columns =['myid', 'myorder', 'x', 'y'])
df.sort_values(by=['myid', 'myorder'], inplace=True) #Line10
df.reset_index(drop=True, inplace=True)
display(df.style.hide_index())
### Typical analysis on existing DataFrame, Error occurs in here ####
for id in df.myid.unique():
tempdf = df[mygdf.myid == id]
tempdf.sort_values(by=['myid', 'myorder'], inplace=True) #Line17
tempdf.reset_index(drop=True, inplace=True)
for i, r in tempdf.iloc[1:].iterrows():
## in reality, calling a more complicated function here
## this is just a simple example
tempdf.loc[i, 'f1'] = tempdf.x[i-1] - tempdf.x[i]
tempdf.loc[i, 'f2'] = tempdf.y[i-1] - tempdf.y[i]
tempdf.loc[i, 'f3'] = tempdf.y[i] +2
what_i_care_about = ['myid', 'myorder', 'f1', 'f2', 'f3']
df = pd.merge(df, tempdf[what_i_care_about],
on=['myid', 'myorder'], how='outer')
del tempdf
display(df.style.hide_index())
Answer. Yes. Order of the merged dataframes will effect the order of the rows and columns of the merged dataframe. When using the merge() method, it will preserve the order of the left keys.
We can use join and merge to combine 2 dataframes. The join method works best when we are joining dataframes on their indexes (though you can specify another column to join on for the left dataframe). The merge method is more versatile and allows us to specify columns besides the index to join on for both dataframes.
Inner joins The most common type of join is called an inner join. An inner join combines two DataFrames based on a join key and returns a new DataFrame that contains only those rows that have matching values in both of the original DataFrames.
The merge() method updates the content of two DataFrame by merging them together, using the specified method(s). Use the parameters to control which values to keep and which to replace.
The suffix is needed only when the merged dataframe has two columns with same name. When you merge df3, your dataframe has column names val_1 and val_2 so there is no overlap.
If True, adds a column to the output DataFrame called “_merge” with information on the source of each row. The column can be given a different name by providing a string argument.
But the Pandas gets a bit trickier since the last amount column does not conflict with any previous columns, so it ends up in the resulting data frame as just amount (unlike the two previous amount columns, which conflict and get renamed to amount__1 and amount__2 ).
The suffix is needed only when the merged dataframe has two columns with same name. When you merge df3, your dataframe has column names val_1 and val_2 so there is no overlap. You can handle that by renaming val to val_3 like this
Your problem is that there are columns you are not merging on that are common to both source DataFrames. Pandas needs a way to say which one came from where, so it adds the suffixes, the defaults being '_x'
on the left and '_y'
on the right.
If you have a preference on which source data frame to keep the columns from, then you can set the suffixes and filter accordingly, for example if you want to keep the clashing columns from the left:
# Label the two sides, with no suffix on the side you want to keep
df = pd.merge(
df,
tempdf[what_i_care_about],
on=['myid', 'myorder'],
how='outer',
suffixes=('', '_delme') # Left gets no suffix, right gets something identifiable
)
# Discard the columns that acquired a suffix
df = df[[c for c in df.columns if not c.endswith('_delme')]]
Alternatively, you can drop one of each of the clashing columns prior to merging, then Pandas has no need to assign a suffix.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With