Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding the Location of the Duplicate for Duplicated Columns in Pandas

I know I can find duplicate columns using:

df.T.duplicated()

what I'd like to know the index that a duplicate column is a duplicate of. For example, both C and D are duplicates of a A below:

df = pd.DataFrame([[1,0,1,1], [2,0,2,2]], columns=['A', 'B', 'C', 'D'])

   A  B  C  D
0  1  0  1  1
1  2  0  2  2

I'd like something like:

duplicate_index = pd.Series([None, None, 'A', 'A'], ['A', 'B', 'C', 'D'])
like image 924
rhaskett Avatar asked Mar 05 '23 21:03

rhaskett


1 Answers

I don't know if duplicated have an option to give information about the first row with the same data. My idea is by using groupby and transform such as:

arr_first = (df.T.reset_index().groupby([col for col in df.T.columns])['index']
                .transform(lambda x: x.iloc[0]).values)

With your example, arr_first is then equal to array(['A', 'B', 'A', 'A'], dtype=object) and because they have the same order than df.columns, to get the expected output, you use np.where like:

duplicate_index = pd.Series(pd.np.where(arr_first != df.columns, arr_first, None),df.columns)

and the result for duplicate_index is

A    None
B    None
C       A
D       A
dtype: object
like image 52
Ben.T Avatar answered Apr 26 '23 11:04

Ben.T