Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: find a duplicate in a container efficiently

I have a container cont. If I want to find out if it has duplicates, I'll just check len(cont) == len(set(cont)).

Suppose I want to find a duplicate element if it exists (just any arbitrary duplicate element). Is there any neat and efficient way to write that?

[Python 3]

like image 485
max Avatar asked Nov 16 '10 04:11

max


4 Answers

You can start adding them to the set and as soon as you try to add the element that is already in the set you found a duplicate.

like image 181
icyrock.com Avatar answered Nov 06 '22 10:11

icyrock.com


Ok, my first answer has gotten quite a bit of flack, so I thought I'd try a few different ways of doing this and report the differences. Here's my code.

import sys
import itertools

def getFirstDup(c, toTest):

    # Original idea using list slicing => 5.014 s
    if toTest == '1':
        for i in xrange(0, len(c)):
            if c[i] in c[:i]:
                return c[i]

    # Using two sets => 4.305 s
    elif toTest == '2':
        s = set()
        for i in c:
            s2 = s.copy()
            s.add(i)
            if len(s) == len(s2):
                return i

    # Using dictionary LUT => 0.763 s
    elif toTest == '3':
        d = {}
        for i in c:
            if i in d:
                return i
            else:
                d[i] = 1

    # Using set operations => 0.772 s
    elif toTest == '4':
        s = set()
        for i in c:
            if i in s:
                return i
            else:
                s.add(i)

    # Sorting then walking => 5.130 s
    elif toTest == '5':
        c = sorted(c)
        for i in xrange(1, len(c)):
            if c[i] == c[i - 1]:
                return c[i]

    # Sorting then groupby-ing => 5.086 s
    else:
        c = sorted(c)
        for k, g in itertools.groupby(c):
            if len(list(g)) > 1:
                return k

    return None


c = list(xrange(0, 10000000))
c[5000] = 0

for i in xrange(0, 10):
    print getFirstDup(c, sys.argv[1])

Basically, I try this in six different ways, as listed in the source file. I used the Linux time command and collected the realtime runtimes, running the commands like so

time python ./test.py 1

with 1 being which algorithm I wanted to try. Each algorithm looks for the first duplicate in 10,000,000 integers, and runs ten times. There is one duplication in the list, which is "mostly sorted" though I did try reverse sorted lists without noticing a proportional difference between algorithms.

My original suggestion did poorly at 5.014 s. My understanding of icyrock.com's solution also did poorly at 4.305 s. Next I tried using a dictionary to create a LUT, which gave the best runtime at 0.763 s. I tried using the in operator on sets, and got 0.772 s, nearly as good as the dictionary LUT. I tried sorting and walking the list, which gave a pitiful time of 5.130 s. Finally, I tried John Machin's suggestion of the itertools, which gave a poor time of 5.086 s.

In summary, a dictionary LUT seems to be the way to go, with set operations (which may use LUTs in its implementation) being a close second.


Update: I tried razpeitia's suggestion, and apart from the fact that you need to know precisely what duplicate key you're looking for, the actual algorithm did the worst so far (66.366 s).


Update 2: I'm sure someone is going to say that this test is biased because the duplicate location is near one end of the list. Try running the code using a different location before downvoting and report your results!

like image 44
Zeke Avatar answered Nov 06 '22 10:11

Zeke


It's not apparent what's the point of finding one arbitrary element that is a duplicate or 1 or more other elements of the collection ... do you want to remove it? merge its attributes with those of its twins / triplets / ... / N-tuplets? In any case, that's an O(N) operation, which if repeated until no more duplicates are detected is an O(N ** 2) operation.

However you can get a bulk deal at the algorithm warehouse: sort the collection -- O(N*log(N)) -- and then use itertools.groupby to bunch up the duplicates and cruise through the bunches, ignoring the bunches of size 1 and doing whatever you want with the bunches of size > 1 -- all of that is only about O(N).

like image 27
John Machin Avatar answered Nov 06 '22 10:11

John Machin


from collections import Counter

cont = [1, 2, 3]
c = Counter(cont)
x = someItem

if c[x] == 0:
    print("Not in cont")
elif c[x] == 1:
    print("Unique")
else:
    print("Duplicate")
like image 25
razpeitia Avatar answered Nov 06 '22 10:11

razpeitia