I have a data frame with multiple columns, and I want to find the duplicates in some of them. My columns go from A to Z. I want to know which lines have the same values in columns A, D, F, K, L, and G.
I tried:
df = df[df.duplicated(keep=False)]
df = df.groupby(df.columns.tolist()).apply(lambda x: tuple(x.index)).tolist()
However, this uses all of the columns.
I also tried
print(df[df.duplicated(['A', 'D', 'F', 'K', 'L', 'P'])])
This only returns the duplication's index. I want the index of both lines that have the same values.
Your final attempt is close. Instead of grouping by all columns, just use a list of the ones you want to consider:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [3, 3, 3, 4, 4, 5],
'C': [6, 7, 8, 9, 10, 11]})
res = df.groupby(['A', 'B']).apply(lambda x: (x.index).tolist()).reset_index()
print(res)
# A B 0
# 0 1 3 [0, 1, 2]
# 1 2 4 [3, 4]
# 2 2 5 [5]
Different layout of groupby
df.index.to_series().groupby([df['A'],df['B']]).apply(list)
Out[449]:
A B
1 3 [0, 1, 2]
2 4 [3, 4]
5 [5]
dtype: object
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With