I know that the efficiency of this code is not optimal (esp. with gigantic inputs), and I know that there is a way to change this algorithm to handle other data types and not just a repetition in a string (obviously there are only so many characters to search through). Is there any way I can increase efficiency here? I tried using a dictionary and the function kept returning 'none' so I tried a list and things worked out fine. Thanks ahead of time to anyone who can help me out! <pre class="prettyprint"><code>def find_repeater(string): my_list = [] my_list.append(string[0]) for i in range (1, len(string)): if string[i] in my_list: print 'repetition found' return (string[i]) else: my_list.append(string[i]) print find_repeater('abca') </code></pre> now with a dictionary....(it keeps printing 'none' to the console) <pre class="prettyprint"><code>def find_repeater(string): my_dict = {} my_dict[0] = string[0] for i in range (1, len(string)): if string[i] in my_dict: print 'repetition found' return string[i] else: my_dict[i] = string[i] print find_repeater('abca') </code></pre>

As this is a performance question, let's do some timings: <pre class="prettyprint"><code>def test_set(xs): seen = set() # O(1) lookups for x in xs: if x not in seen: seen.add(x) else: return x import collections def test_counter(xs): freq = collections.Counter(xs) for k in freq: if freq[k] > 1: return k def test_dict(xs): d = {} for x in xs: if x in d: return x d[x] = 1 def test_sort(xs): ys = sorted(xs) for n in range(1, len(xs)): if ys[n] == ys[n-1]: return ys[n] ## import sys, timeit print (sys.version + "\n") xs = list(range(10000)) + [999] fns = [p for name, p in globals().items() if name.startswith('test')] for fn in fns: assert fn(xs) == 999 print ('%50s %.5f' % (fn, timeit.timeit(lambda: fn(xs), number=100))) </code></pre> I'm testing on an list of integers rather than a string (because with a string you can't get more than 256 loops). The results on my machine look like this: <pre class="prettyprint"><code>3.2.3 (v3.2.3:3d0686d90f55, Apr 10 2012, 11:25:50) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] <function test_set at 0x1020f7380> 0.19265 <function test_dict at 0x1020f7490> 0.12725 <function test_sort at 0x1020f7518> 0.04683 <function test_counter at 0x1020f7408> 0.92485 </code></pre> So the sort method appears to be the winner. I guess this is because it doesn't waste time creating hashes and allocating dict/set structures. Also, if you don't care about the source list being changed, you can do <code>xs.sort()</code> instead of <code>ys = sorted(xs)</code>, which gives you zero memory footprint. On the other side, if repeated items are more probable to occur towards the beginning of the input (as in <code>xs = 'abcdef' * 10000</code>), the <code>set</code> method will perform the best, as it, unlike <code>sort</code> or <code>Counter</code>, returns immediately once a repeat is found and doesn't need to preprocess the whole list. You should also use <code>set</code> if you need the first repeating element, not just one of them. <code>Counter</code> is a nice tool, but it's not designed for performance, so if you really have to deal with "gigantic inputs", go with sets (if they fit in memory) or mergesort if they don't.

Efficiently find repeated characters in a string

Tags:

performance

python

string

I know that the efficiency of this code is not optimal (esp. with gigantic inputs), and I know that there is a way to change this algorithm to handle other data types and not just a repetition in a string (obviously there are only so many characters to search through).

Is there any way I can increase efficiency here?

I tried using a dictionary and the function kept returning 'none' so I tried a list and things worked out fine.

Thanks ahead of time to anyone who can help me out!

def find_repeater(string):
    my_list = []
    my_list.append(string[0])

    for i in range (1, len(string)):

        if string[i] in my_list:
            print 'repetition found'
            return (string[i])

        else:
            my_list.append(string[i])

print find_repeater('abca')

now with a dictionary....(it keeps printing 'none' to the console)

def find_repeater(string):
    my_dict = {}
    my_dict[0] = string[0]

    for i in range (1, len(string)):

        if string[i] in my_dict:
            print 'repetition found'
            return string[i]

        else:
            my_dict[i] = string[i]

print find_repeater('abca')

775

asked Sep 07 '14 00:09

ChipSkylark

1 Answers

As this is a performance question, let's do some timings:

def test_set(xs):
    seen = set()  # O(1) lookups
    for x in xs:
        if x not in seen:
            seen.add(x)
        else:
            return x

import collections

def test_counter(xs):
    freq = collections.Counter(xs)
    for k in freq:
        if freq[k] > 1:
            return k

def test_dict(xs):
    d = {}
    for x in xs:
        if x in d:
            return x
        d[x] = 1

def test_sort(xs):
    ys = sorted(xs)

    for n in range(1, len(xs)):
        if ys[n] == ys[n-1]:
            return ys[n]

##

import sys, timeit
print (sys.version + "\n")
xs = list(range(10000)) + [999]
fns = [p for name, p in globals().items() if name.startswith('test')]
for fn in fns:
    assert fn(xs) == 999
    print ('%50s %.5f' % (fn, timeit.timeit(lambda: fn(xs), number=100)))

I'm testing on an list of integers rather than a string (because with a string you can't get more than 256 loops). The results on my machine look like this:

3.2.3 (v3.2.3:3d0686d90f55, Apr 10 2012, 11:25:50) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]

                <function test_set at 0x1020f7380> 0.19265
               <function test_dict at 0x1020f7490> 0.12725
               <function test_sort at 0x1020f7518> 0.04683
            <function test_counter at 0x1020f7408> 0.92485

So the sort method appears to be the winner. I guess this is because it doesn't waste time creating hashes and allocating dict/set structures. Also, if you don't care about the source list being changed, you can do xs.sort() instead of ys = sorted(xs), which gives you zero memory footprint.

On the other side, if repeated items are more probable to occur towards the beginning of the input (as in xs = 'abcdef' * 10000), the set method will perform the best, as it, unlike sort or Counter, returns immediately once a repeat is found and doesn't need to preprocess the whole list. You should also use set if you need the first repeating element, not just one of them.

Counter is a nice tool, but it's not designed for performance, so if you really have to deal with "gigantic inputs", go with sets (if they fit in memory) or mergesort if they don't.

182

answered Oct 20 '22 06:10

georg

Related questions
                            
                                'builtin_function_or_method' object is not subscriptable
                            
                                pyqt dynamic generate QMenu action and connect
                            
                                Tkinter - How to create submenus in menubar
                            
                                python print "hello world" vs "hello world"
                            
                                Python GTK+ 3 Safe Threading
                            
                                Twilio Python Module Errors After Compiling
                            
                                How do I update Scrapy from the Terminal?
                            
                                Environment $PATH different when using venv
                            
                                How to get an array from RGB values of a bitmap image?
                            
                                Pygame - Collision detection with two CIRCLES
                            
                                parallel excution and file writing on python
                            
                                How to get maya main window pointer using PySide?
                            
                                Django backwards relation
                            
                                Delete model by primary key in SQLAlchemy
                            
                                When cassandra-driver was executing the query, cassandra-driver returned error OperationTimedOut
                            
                                How can I completely remove any logging from requests module in Python
                            
                                Variable Substitution in Python
                            
                                Two Flask Applications at same time
                            
                                save a plot resulting from a function matplotlib python
                            
                                Python Voice Recognition Library - Always Listen?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With