Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

find duplicates of items endings in a list

Tags:

python

list

I would like to find items in a list that have duplicate endings within the last 3 characters of the string

I know how to find duplicates using the code below, but need help with code how to find that the last strings of "sara" and "tamara" are the same so that one of the items can be copied to a duplicate_finding list

using the following code, I can only identifiy exact duplicate items of a list:

names = ["tom", "john", "sara" , "tamara" , "tom"]
single_finds = []
duplicate_finds = []

for i in names:
        if i in single_finds:
                duplicate_finds.append(i)
        else:
                single_finds.append(i)

print (single_finds)
print (duplicate_finds)

OUTPUT:

['tom', 'john', 'sara', 'tamara']
['tom']
like image 710
sebastian20182018 Avatar asked May 24 '19 15:05

sebastian20182018


2 Answers

One approach would be to use itertools.groupby, specifying that we want to group based on the last n characters using the key argument.

Then we can flatten the list removing those sublists with only 1 item using itertools.chain and take a set to remove duplicates (or a list if you want them):

from itertools import groupby, chain
k = lambda x: x[-3:]
l = [list(v) for _,v in groupby(sorted(names, key=k), key=k)]
# [['tamara', 'sara'], ['john'], ['tom', 'tom']]
[i[0] for i in l if len(i) > 1]
# ['tamara', 'tom']
like image 76
yatu Avatar answered Oct 13 '22 02:10

yatu


Accumulate names per-suffix using a dict, and then gather the results:

>>> from collections import defaultdict 
>>> d = defaultdict(list) 
>>> for name in names: 
...     suffix = name[-3:] 
...     d[suffix].append(name) 
... 
>>> for suffix, names in d.items(): 
...     print("-", suffix, ":", *names) 
... 
- tom : tom tom
- ohn : john
- ara : sara tamara

You can partition d.items() into singles and dupes by looking at the len(names) now.

This is an O(n) time-complexity solution, as opposed to groupby-based approaches which require pre-sorting the data at O(n log n).

like image 28
wim Avatar answered Oct 13 '22 00:10

wim