Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas to_excel() ignore/allow duplicate column names

Is there way to ignore duplicate column names after using to_excel() function of pandas?

Say, I have old_wb.xlsx:

>> df1 = pd.read_excel('wb1.xlsx')
        ---------------------merged header--------------------
        col1    col2   col3   col1   col4   col1   col2   col5
        test    test   test   test   test   test   test   test

and say I did some processing to my Excel file, for example, remove the merged header and save it to another Excel file:

>> df1.to_excel('new_wb.xlsx', 'Sheet1', merged_cells=False, header=None, index=False)

The column names of new_wb.xlsx looks like this:

        col1    col2   col3   col1.1   col4   col1.2   col2.1   col5
        test    test   test   test     test   test     test     test

It added .1 to the duplicate column names and it also increments as the duplicate column name goes.

I tried renaming the column name before using to_excel() but it didn't work. It seems the renaming of duplicates happens in to_excel().

>> df1.rename(columns=lambda x: x.replace('.1',''))

Upon searching, I found an argument for to_excel() which is mangle_dupe_cols=False, unfortunately it returned:

ValueError: Setting mangle_dupe_cols=False is not supported yet

Any help on how to ignore duplicate column names in saving `to_excel()'

like image 723
Ricky Aguilar Avatar asked May 15 '18 15:05

Ricky Aguilar


People also ask

Does pandas allow us to have duplicate column names?

Index objects are not required to be unique; you can have duplicate row or column labels.

How do I remove duplicate columns in pandas?

To drop duplicate columns from pandas DataFrame use df. T. drop_duplicates(). T , this removes all columns that have the same data regardless of column names.

How avoid duplicates in pandas merge?

In this approach to prevent duplicated columns from joining the two data frames, the user needs simply needs to use the pd. merge() function and pass its parameters as they join it using the inner join and the column names that are to be joined on from left and right data frames in python.

How can check duplicate column in pandas?

To find duplicate columns we need to iterate through all columns of a DataFrame and for each and every column it will search if any other column exists in DataFrame with the same contents already. If yes then that column name will be stored in the duplicate column set.


2 Answers

@Ricky Aguilar has a great solution. I took his solution and just made it more dynamic.

Now you can rename all your duplicate headers, without even knowing what their values are

def dataframe_allowing_duplicate_headers():
    # To Hold All The Possible Duplicate Tags ['.1', '.2', '.3', ...]
    dup_id_range = []

    # Load Your Excel File Using Pandas
    dataframe = pandas.read_excel("path_to_excel_file", sheet_name="sheetname")

    # Generate And Store All The Possible Duplicate Tags ['.1', '.2', '.3', ...]
    for count in range(0, len(dataframe.columns)):
        dup_id_range.append( '.{}'.format(count) )

    # Search And Replace All Duplicate Headers To What It Was Set As Originally
    def rename(dataframe, character_number):
        duplicate_columns_chars = list(
            filter(lambda v: v[(len(v)-character_number):] in dup_id_range,
            dataframe.columns))

        for duplicate_column in duplicate_columns_chars:
            dataframe = dataframe.rename(
                columns={duplicate_column:duplicate_column[:-character_number]})
        return dataframe


    # Replace The Possible Duplicates Respectfully Based On Columns Count
    if len(dup_id_range) > 0:
        dataframe = rename(dataframe, 2)
        if len(dup_id_range) > 9:
            dataframe = rename(dataframe, 3)
            if len(dup_id_range) > 99:
                dataframe = rename(dataframe, 4)
                # If You Have More Than A Thousand Columns (lol)
                #if len(dup_id_range) > 999:
                #    dataframe = rename(dataframe, 5)

    return dataframe

USAGE:

# This Dataframe Will Have All Your Headers, Allowing Your Duplicates
my_dataframe = dataframe_allowing_duplicate_headers()
like image 51
PhillipJacobs Avatar answered Oct 24 '22 07:10

PhillipJacobs


It's workable using:

df1.rename(columns={'old_name':'new_name'})

Although, it doesn't look good as I have 10 columns to rename.

like image 27
Ricky Aguilar Avatar answered Oct 24 '22 08:10

Ricky Aguilar