Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get index for all the duplicates in a dataframe (pandas - python)

Tags:

python

pandas

I have a data frame with multiple columns, and I want to find the duplicates in some of them. My columns go from A to Z. I want to know which lines have the same values in columns A, D, F, K, L, and G.

I tried:

df = df[df.duplicated(keep=False)]
df = df.groupby(df.columns.tolist()).apply(lambda x: tuple(x.index)).tolist()

However, this uses all of the columns.

I also tried

print(df[df.duplicated(['A', 'D', 'F', 'K', 'L', 'P'])])

This only returns the duplication's index. I want the index of both lines that have the same values.

like image 392
Ally Avatar asked Dec 02 '25 15:12

Ally


2 Answers

Your final attempt is close. Instead of grouping by all columns, just use a list of the ones you want to consider:

df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
                   'B': [3, 3, 3, 4, 4, 5],
                   'C': [6, 7, 8, 9, 10, 11]})

res = df.groupby(['A', 'B']).apply(lambda x: (x.index).tolist()).reset_index()

print(res)

#    A  B          0
# 0  1  3  [0, 1, 2]
# 1  2  4     [3, 4]
# 2  2  5        [5]
like image 52
jpp Avatar answered Dec 05 '25 05:12

jpp


Different layout of groupby

df.index.to_series().groupby([df['A'],df['B']]).apply(list)
Out[449]: 
A  B
1  3    [0, 1, 2]
2  4       [3, 4]
   5          [5]
dtype: object
like image 35
BENY Avatar answered Dec 05 '25 05:12

BENY