Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast method for removing duplicate columns in pandas.Dataframe

Tags:

python

pandas

so by using

df_ab = pd.concat([df_a, df_b], axis=1, join='inner')

I get a Dataframe looking like this:

    A    A    B    B
0   5    5   10   10
1   6    6   19   19

and I want to remove its multiple columns:

    A     B
0   5    10
1   6    19

Because df_a and df_b are subsets of the same Dataframe I know that all rows have the same values if the column name is the same. I have a working solution:

df_ab = df_ab.T.drop_duplicates().T

but I have a number of rows so this one is very slow. Does someone have a faster solution? I would prefer a solution where explicit knowledge of the column names isn't needed.

like image 904
Peter Klauke Avatar asked Aug 17 '15 00:08

Peter Klauke


2 Answers

Perhaps you would be better off avoiding the problem altogether, by using pd.merge instead of pd.concat:

df_ab = pd.merge(df_a, df_b, how='inner')

This will merge df_a and df_b on all columns shared in common.

like image 34
unutbu Avatar answered Oct 04 '22 14:10

unutbu


The easiest way is:

df = df.loc[:,~df.columns.duplicated()]

One line of code can change everything

like image 136
Prayson W. Daniel Avatar answered Oct 04 '22 15:10

Prayson W. Daniel