merge() function to join the two data frames by inner join. Now, add a suffix called 'remove' for newly joined columns that have the same name in both data frames. Use the drop() function to remove the columns with the suffix 'remove'. This will ensure that identical columns don't exist in the new dataframe.
To concatenate DataFrames, use the concat() method, but to ignore duplicates, use the drop_duplicates() method.
Merge can be used in cases where both the left and right columns are not unique, and therefore cannot be an index. A merge is also just as efficient as a join as long as: Merging is done on indexes if possible.
By default, when you concatenate two dataframes with duplicate records, Pandas automatically combine them together without removing the duplicate rows.
You can work out the columns that are only in one DataFrame and use this to select a subset of columns in the merge.
cols_to_use = df2.columns.difference(df.columns)
Then perform the merge (note this is an index object but it has a handy tolist()
method).
dfNew = merge(df, df2[cols_to_use], left_index=True, right_index=True, how='outer')
This will avoid any columns clashing in the merge.
I use the suffixes
option in .merge()
:
dfNew = df.merge(df2, left_index=True, right_index=True,
how='outer', suffixes=('', '_y'))
dfNew.drop(dfNew.filter(regex='_y$').columns.tolist(),axis=1, inplace=True)
Thanks @ijoseph
Building on @rprog's answer, you can combine the various pieces of the suffix & filter step into one line using a negative regex:
dfNew = df.merge(df2, left_index=True, right_index=True,
how='outer', suffixes=('', '_DROP')).filter(regex='^(?!.*_DROP)')
Or using df.join
:
dfNew = df.join(df2, lsuffix="DROP").filter(regex="^(?!.*DROP)")
The regex here is keeping anything that does not end with the word "DROP", so just make sure to use a suffix that doesn't appear among the columns already.
I'm freshly new with Pandas but I wanted to achieve the same thing, automatically avoiding column names with _x or _y and removing duplicate data. I finally did it by using this answer and this one from Stackoverflow
sales.csv
city;state;units Mendocino;CA;1 Denver;CO;4 Austin;TX;2
revenue.csv
branch_id;city;revenue;state_id 10;Austin;100;TX 20;Austin;83;TX 30;Austin;4;TX 47;Austin;200;TX 20;Denver;83;CO 30;Springfield;4;I
merge.py import pandas
def drop_y(df):
# list comprehension of the cols that end with '_y'
to_drop = [x for x in df if x.endswith('_y')]
df.drop(to_drop, axis=1, inplace=True)
sales = pandas.read_csv('data/sales.csv', delimiter=';')
revenue = pandas.read_csv('data/revenue.csv', delimiter=';')
result = pandas.merge(sales, revenue, how='inner', left_on=['state'], right_on=['state_id'], suffixes=('', '_y'))
drop_y(result)
result.to_csv('results/output.csv', index=True, index_label='id', sep=';')
When executing the merge command I replace the _x
suffix with an empty string and them I can remove columns ending with _y
output.csv
id;city;state;units;branch_id;revenue;state_id 0;Denver;CO;4;20;83;CO 1;Austin;TX;2;10;100;TX 2;Austin;TX;2;20;83;TX 3;Austin;TX;2;30;4;TX 4;Austin;TX;2;47;200;TX
This is a bit of going around the problem, but I have written a function that basically deals with the extra columns:
def merge_fix_cols(df_company,df_product,uniqueID):
df_merged = pd.merge(df_company,
df_product,
how='left',left_on=uniqueID,right_on=uniqueID)
for col in df_merged:
if col.endswith('_x'):
df_merged.rename(columns = lambda col:col.rstrip('_x'),inplace=True)
elif col.endswith('_y'):
to_drop = [col for col in df_merged if col.endswith('_y')]
df_merged.drop(to_drop,axis=1,inplace=True)
else:
pass
return df_merged
Seems to work well with my merges!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With