I know I can find duplicate columns using:
df.T.duplicated()
what I'd like to know the index that a duplicate column is a duplicate of. For example, both C
and D
are duplicates of a A
below:
df = pd.DataFrame([[1,0,1,1], [2,0,2,2]], columns=['A', 'B', 'C', 'D'])
A B C D
0 1 0 1 1
1 2 0 2 2
I'd like something like:
duplicate_index = pd.Series([None, None, 'A', 'A'], ['A', 'B', 'C', 'D'])
I don't know if duplicated
have an option to give information about the first row with the same data. My idea is by using groupby
and transform
such as:
arr_first = (df.T.reset_index().groupby([col for col in df.T.columns])['index']
.transform(lambda x: x.iloc[0]).values)
With your example, arr_first
is then equal to array(['A', 'B', 'A', 'A'], dtype=object)
and because they have the same order than df.columns
, to get the expected output, you use np.where
like:
duplicate_index = pd.Series(pd.np.where(arr_first != df.columns, arr_first, None),df.columns)
and the result for duplicate_index
is
A None
B None
C A
D A
dtype: object
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With