Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to drop duplicate data with different column names in pandas?

I have a DataFrame with columns with duplicate data with different names:

In[1]: df
Out[1]: 
  X1   X2  Y1   Y2
 0.0  0.0  6.0  6.0
 3.0  3.0  7.1  7.1
 7.6  7.6  1.2  1.2

I know .drop(columns = ) exists but is there a way more efficient way to drop these without having to list down the column names? or not.. please let me know as i can just use .drop()

like image 463
ahnnni Avatar asked Dec 14 '22 06:12

ahnnni


1 Answers

We can use np.unique over axis 1. Unfortunately, there's no pandas built-in function to drop duplicate columns.

df.drop_duplicates only removes duplicate rows.

Return DataFrame with duplicate rows removed.

We can create a function around np.unique to drop duplicate columns.

def drop_duplicate_cols(df):
    uniq, idxs = np.unique(df, return_index=True, axis=1)
    return pd.DataFrame(uniq, index=df.index, columns=df.columns[idxs])

drop_duplicate_cols(X)
    X1   Y1
0  0.0  6.0
1  3.0  7.1
2  7.6  1.2

Online Demo

NB: np.unique docs:

Returns the sorted unique elements of an array.

Workaround: To retain the original order, sort the idxs.


Using .T on dataframe having multiple dtypes is going to mess with your actual dtypes.

df = pd.DataFrame({'A': [0, 1], 'B': ['a', 'b'], 'C': [0, 1], 'D':[2.1, 3.1]})
df.dtypes
A      int64
B     object
C      int64
D    float64
dtype: object

df.T.T.dtypes
A    object
B    object
C    object
D    object
dtype: object
# To get back original `dtypes` we can use `.astype`
df.T.T.astype(df.dtypes).dtypes
A      int64
B     object
C      int64
D    float64
dtype: object
like image 84
Ch3steR Avatar answered Jan 03 '23 09:01

Ch3steR