I would like to find items in a list that have duplicate endings within the last 3 characters of the string
I know how to find duplicates using the code below, but need help with code how to find that the last strings of "sara" and "tamara" are the same so that one of the items can be copied to a duplicate_finding list
using the following code, I can only identifiy exact duplicate items of a list:
names = ["tom", "john", "sara" , "tamara" , "tom"]
single_finds = []
duplicate_finds = []
for i in names:
if i in single_finds:
duplicate_finds.append(i)
else:
single_finds.append(i)
print (single_finds)
print (duplicate_finds)
OUTPUT:
['tom', 'john', 'sara', 'tamara']
['tom']
One approach would be to use itertools.groupby
, specifying that we want to group based on the last n
characters using the key
argument.
Then we can flatten the list removing those sublists with only 1
item using itertools.chain
and take a set to remove duplicates (or a list
if you want them):
from itertools import groupby, chain
k = lambda x: x[-3:]
l = [list(v) for _,v in groupby(sorted(names, key=k), key=k)]
# [['tamara', 'sara'], ['john'], ['tom', 'tom']]
[i[0] for i in l if len(i) > 1]
# ['tamara', 'tom']
Accumulate names per-suffix using a dict, and then gather the results:
>>> from collections import defaultdict
>>> d = defaultdict(list)
>>> for name in names:
... suffix = name[-3:]
... d[suffix].append(name)
...
>>> for suffix, names in d.items():
... print("-", suffix, ":", *names)
...
- tom : tom tom
- ohn : john
- ara : sara tamara
You can partition d.items()
into singles and dupes by looking at the len(names)
now.
This is an O(n) time-complexity solution, as opposed to groupby-based approaches which require pre-sorting the data at O(n log n).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With