I have a DataFrame with columns with duplicate data with different names:
In[1]: df
Out[1]:
X1 X2 Y1 Y2
0.0 0.0 6.0 6.0
3.0 3.0 7.1 7.1
7.6 7.6 1.2 1.2
I know .drop(columns = ) exists but is there a way more efficient way to drop these without having to list down the column names? or not.. please let me know as i can just use .drop()
We can use np.unique
over axis 1. Unfortunately, there's no pandas built-in function to drop duplicate columns.
df.drop_duplicates
only removes duplicate rows.
Return DataFrame with duplicate rows removed.
We can create a function around np.unique
to drop duplicate columns.
def drop_duplicate_cols(df):
uniq, idxs = np.unique(df, return_index=True, axis=1)
return pd.DataFrame(uniq, index=df.index, columns=df.columns[idxs])
drop_duplicate_cols(X)
X1 Y1
0 0.0 6.0
1 3.0 7.1
2 7.6 1.2
Online Demo
NB:
np.unique
docs:Returns the sorted unique elements of an array.
Workaround: To retain the original order, sort the
idxs
.
Using .T
on dataframe having multiple dtypes
is going to mess with your actual dtypes
.
df = pd.DataFrame({'A': [0, 1], 'B': ['a', 'b'], 'C': [0, 1], 'D':[2.1, 3.1]})
df.dtypes
A int64
B object
C int64
D float64
dtype: object
df.T.T.dtypes
A object
B object
C object
D object
dtype: object
# To get back original `dtypes` we can use `.astype`
df.T.T.astype(df.dtypes).dtypes
A int64
B object
C int64
D float64
dtype: object
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With