In a dataframe with 2 columns [id][string], I need to know which lines are duplicates of which lines based on the value of the column [string].
My dataframe has thousands of rows but only 2 columns.
Sample of the input dataframe:
id,string
0,"A B C D"
1,"D B C D E Z"
2,"A B C D"
3,"Z Z Z Z Z Z Z Z Z Z Z Z"
4,"D B C D E Z"
5,"A B C D"
In this sample, rows 0, 2, 5 are duplicates of each other. Also rows 1 and 4 are duplicates of each other. (id is unique)
I want the following output:
[["0","2","5"]],["1","4"]]
You can filter by length of lists after aggregate list per string in boolean indexing with Series.str.len:
s = df.assign(id = df['id'].astype(str)).groupby('string')['id'].apply(list)
out = s[s.str.len().gt(1)].tolist()
If already id are strings:
s = df.groupby('string')['id'].apply(list)
out = s[s.str.len().gt(1)].tolist()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With