Pandas Merge - How to avoid duplicating columns

People also ask

How do I get rid of duplicate columns while merging Pandas?

merge() function to join the two data frames by inner join. Now, add a suffix called 'remove' for newly joined columns that have the same name in both data frames. Use the drop() function to remove the columns with the suffix 'remove'. This will ensure that identical columns don't exist in the new dataframe.

How do I merge Pandas without duplicates?

To concatenate DataFrames, use the concat() method, but to ignore duplicates, use the drop_duplicates() method.

Is Pandas merge efficient?

Merge can be used in cases where both the left and right columns are not unique, and therefore cannot be an index. A merge is also just as efficient as a join as long as: Merging is done on indexes if possible.

Does PD concat remove duplicates?

By default, when you concatenate two dataframes with duplicate records, Pandas automatically combine them together without removing the duplicate rows.

You can work out the columns that are only in one DataFrame and use this to select a subset of columns in the merge.

cols_to_use = df2.columns.difference(df.columns)

Then perform the merge (note this is an index object but it has a handy tolist() method).

dfNew = merge(df, df2[cols_to_use], left_index=True, right_index=True, how='outer')

This will avoid any columns clashing in the merge.

I use the suffixes option in .merge():

dfNew = df.merge(df2, left_index=True, right_index=True,
                 how='outer', suffixes=('', '_y'))
dfNew.drop(dfNew.filter(regex='_y$').columns.tolist(),axis=1, inplace=True)

Thanks @ijoseph

Building on @rprog's answer, you can combine the various pieces of the suffix & filter step into one line using a negative regex:

dfNew = df.merge(df2, left_index=True, right_index=True,
             how='outer', suffixes=('', '_DROP')).filter(regex='^(?!.*_DROP)')

Or using df.join:

dfNew = df.join(df2, lsuffix="DROP").filter(regex="^(?!.*DROP)")

The regex here is keeping anything that does not end with the word "DROP", so just make sure to use a suffix that doesn't appear among the columns already.

I'm freshly new with Pandas but I wanted to achieve the same thing, automatically avoiding column names with _x or _y and removing duplicate data. I finally did it by using this answer and this one from Stackoverflow

sales.csv

    city;state;units
    Mendocino;CA;1
    Denver;CO;4
    Austin;TX;2

revenue.csv

    branch_id;city;revenue;state_id
    10;Austin;100;TX
    20;Austin;83;TX
    30;Austin;4;TX
    47;Austin;200;TX
    20;Denver;83;CO
    30;Springfield;4;I

merge.py import pandas

def drop_y(df):
    # list comprehension of the cols that end with '_y'
    to_drop = [x for x in df if x.endswith('_y')]
    df.drop(to_drop, axis=1, inplace=True)


sales = pandas.read_csv('data/sales.csv', delimiter=';')
revenue = pandas.read_csv('data/revenue.csv', delimiter=';')

result = pandas.merge(sales, revenue,  how='inner', left_on=['state'], right_on=['state_id'], suffixes=('', '_y'))
drop_y(result)
result.to_csv('results/output.csv', index=True, index_label='id', sep=';')

When executing the merge command I replace the _x suffix with an empty string and them I can remove columns ending with _y

output.csv

    id;city;state;units;branch_id;revenue;state_id
    0;Denver;CO;4;20;83;CO
    1;Austin;TX;2;10;100;TX
    2;Austin;TX;2;20;83;TX
    3;Austin;TX;2;30;4;TX
    4;Austin;TX;2;47;200;TX

This is a bit of going around the problem, but I have written a function that basically deals with the extra columns:

def merge_fix_cols(df_company,df_product,uniqueID):
    
    df_merged = pd.merge(df_company,
                         df_product,
                         how='left',left_on=uniqueID,right_on=uniqueID)    
    for col in df_merged:
        if col.endswith('_x'):
            df_merged.rename(columns = lambda col:col.rstrip('_x'),inplace=True)
        elif col.endswith('_y'):
            to_drop = [col for col in df_merged if col.endswith('_y')]
            df_merged.drop(to_drop,axis=1,inplace=True)
        else:
            pass
    return df_merged

Seems to work well with my merges!

Related questions
                            
                                Calling filter returns <filter object at ... > [duplicate]
                            
                                plot with custom text for x axis points
                            
                                The modulo operation on negative numbers in Python
                            
                                Single Line Nested For Loops
                            
                                One-liner to check whether an iterator yields at least one element?
                            
                                How do you set your pythonpath in an already-created virtualenv?
                            
                                Pandas groupby cumulative sum
                            
                                Tensorflow Strides Argument
                            
                                Where is pip cache folder?
                            
                                How to pass another entire column as argument to pandas fillna()
                            
                                ImportError: No module named six
                            
                                Does pandas iterrows have performance issues?
                            
                                Python - How to sort a list of lists by the fourth element in each list? [duplicate]
                            
                                Add a prefix to all Flask routes
                            
                                Extract elements of list at odd positions
                            
                                How to pass a user defined argument in scrapy spider
                            
                                How to copy a file to a remote server in Python using SCP or SSH?
                            
                                Is there an easy way to request a URL in python and NOT follow redirects?
                            
                                TemplateDoesNotExist - Django Error
                            
                                How to load jinja template directly from filesystem

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas Merge - How to avoid duplicating columns

Tags:

python

pandas

People also ask

Recent Activity

Donate For Us