I have a container cont
. If I want to find out if it has duplicates, I'll just check len(cont) == len(set(cont))
.
Suppose I want to find a duplicate element if it exists (just any arbitrary duplicate element). Is there any neat and efficient way to write that?
[Python 3]
You can start adding them to the set and as soon as you try to add the element that is already in the set you found a duplicate.
Ok, my first answer has gotten quite a bit of flack, so I thought I'd try a few different ways of doing this and report the differences. Here's my code.
import sys
import itertools
def getFirstDup(c, toTest):
# Original idea using list slicing => 5.014 s
if toTest == '1':
for i in xrange(0, len(c)):
if c[i] in c[:i]:
return c[i]
# Using two sets => 4.305 s
elif toTest == '2':
s = set()
for i in c:
s2 = s.copy()
s.add(i)
if len(s) == len(s2):
return i
# Using dictionary LUT => 0.763 s
elif toTest == '3':
d = {}
for i in c:
if i in d:
return i
else:
d[i] = 1
# Using set operations => 0.772 s
elif toTest == '4':
s = set()
for i in c:
if i in s:
return i
else:
s.add(i)
# Sorting then walking => 5.130 s
elif toTest == '5':
c = sorted(c)
for i in xrange(1, len(c)):
if c[i] == c[i - 1]:
return c[i]
# Sorting then groupby-ing => 5.086 s
else:
c = sorted(c)
for k, g in itertools.groupby(c):
if len(list(g)) > 1:
return k
return None
c = list(xrange(0, 10000000))
c[5000] = 0
for i in xrange(0, 10):
print getFirstDup(c, sys.argv[1])
Basically, I try this in six different ways, as listed in the source file. I used the Linux time
command and collected the realtime runtimes, running the commands like so
time python ./test.py 1
with 1
being which algorithm I wanted to try. Each algorithm looks for the first duplicate in 10,000,000 integers, and runs ten times. There is one duplication in the list, which is "mostly sorted" though I did try reverse sorted lists without noticing a proportional difference between algorithms.
My original suggestion did poorly at 5.014 s. My understanding of icyrock.com's solution also did poorly at 4.305 s. Next I tried using a dictionary to create a LUT, which gave the best runtime at 0.763 s. I tried using the in
operator on sets, and got 0.772 s, nearly as good as the dictionary LUT. I tried sorting and walking the list, which gave a pitiful time of 5.130 s. Finally, I tried John Machin's suggestion of the itertools, which gave a poor time of 5.086 s.
In summary, a dictionary LUT seems to be the way to go, with set operations (which may use LUTs in its implementation) being a close second.
Update: I tried razpeitia's suggestion, and apart from the fact that you need to know precisely what duplicate key you're looking for, the actual algorithm did the worst so far (66.366 s).
Update 2: I'm sure someone is going to say that this test is biased because the duplicate location is near one end of the list. Try running the code using a different location before downvoting and report your results!
It's not apparent what's the point of finding one arbitrary element that is a duplicate or 1 or more other elements of the collection ... do you want to remove it? merge its attributes with those of its twins / triplets / ... / N-tuplets? In any case, that's an O(N) operation, which if repeated until no more duplicates are detected is an O(N ** 2) operation.
However you can get a bulk deal at the algorithm warehouse: sort the collection -- O(N*log(N)) -- and then use itertools.groupby
to bunch up the duplicates and cruise through the bunches, ignoring the bunches of size 1 and doing whatever you want with the bunches of size > 1 -- all of that is only about O(N).
from collections import Counter
cont = [1, 2, 3]
c = Counter(cont)
x = someItem
if c[x] == 0:
print("Not in cont")
elif c[x] == 1:
print("Unique")
else:
print("Duplicate")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With