Is there way to ignore duplicate column names after using to_excel() function of pandas?
Say, I have old_wb.xlsx:
>> df1 = pd.read_excel('wb1.xlsx')
---------------------merged header--------------------
col1 col2 col3 col1 col4 col1 col2 col5
test test test test test test test test
and say I did some processing to my Excel file, for example, remove the merged header and save it to another Excel file:
>> df1.to_excel('new_wb.xlsx', 'Sheet1', merged_cells=False, header=None, index=False)
The column names of new_wb.xlsx looks like this:
col1 col2 col3 col1.1 col4 col1.2 col2.1 col5
test test test test test test test test
It added .1
to the duplicate column names and it also increments as the duplicate column name goes.
I tried renaming the column name before using to_excel()
but it didn't work. It seems the renaming of duplicates happens in to_excel()
.
>> df1.rename(columns=lambda x: x.replace('.1',''))
Upon searching, I found an argument for to_excel()
which is mangle_dupe_cols=False
, unfortunately it returned:
ValueError: Setting mangle_dupe_cols=False is not supported yet
Any help on how to ignore duplicate column names in saving `to_excel()'
Index objects are not required to be unique; you can have duplicate row or column labels.
To drop duplicate columns from pandas DataFrame use df. T. drop_duplicates(). T , this removes all columns that have the same data regardless of column names.
In this approach to prevent duplicated columns from joining the two data frames, the user needs simply needs to use the pd. merge() function and pass its parameters as they join it using the inner join and the column names that are to be joined on from left and right data frames in python.
To find duplicate columns we need to iterate through all columns of a DataFrame and for each and every column it will search if any other column exists in DataFrame with the same contents already. If yes then that column name will be stored in the duplicate column set.
@Ricky Aguilar has a great solution. I took his solution and just made it more dynamic.
Now you can rename all your duplicate headers, without even knowing what their values are
def dataframe_allowing_duplicate_headers():
# To Hold All The Possible Duplicate Tags ['.1', '.2', '.3', ...]
dup_id_range = []
# Load Your Excel File Using Pandas
dataframe = pandas.read_excel("path_to_excel_file", sheet_name="sheetname")
# Generate And Store All The Possible Duplicate Tags ['.1', '.2', '.3', ...]
for count in range(0, len(dataframe.columns)):
dup_id_range.append( '.{}'.format(count) )
# Search And Replace All Duplicate Headers To What It Was Set As Originally
def rename(dataframe, character_number):
duplicate_columns_chars = list(
filter(lambda v: v[(len(v)-character_number):] in dup_id_range,
dataframe.columns))
for duplicate_column in duplicate_columns_chars:
dataframe = dataframe.rename(
columns={duplicate_column:duplicate_column[:-character_number]})
return dataframe
# Replace The Possible Duplicates Respectfully Based On Columns Count
if len(dup_id_range) > 0:
dataframe = rename(dataframe, 2)
if len(dup_id_range) > 9:
dataframe = rename(dataframe, 3)
if len(dup_id_range) > 99:
dataframe = rename(dataframe, 4)
# If You Have More Than A Thousand Columns (lol)
#if len(dup_id_range) > 999:
# dataframe = rename(dataframe, 5)
return dataframe
USAGE:
# This Dataframe Will Have All Your Headers, Allowing Your Duplicates
my_dataframe = dataframe_allowing_duplicate_headers()
It's workable using:
df1.rename(columns={'old_name':'new_name'})
Although, it doesn't look good as I have 10 columns to rename.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With