What is the fastest way to check if a string contains repeating characters in Python 3?

Tags:

I need to filter strings by the criterion that they contain no character twice.

The strings are many (say 1.4 trillion).
The strings are short (around 8 characters).
The strings are unique (caching won't work).
The strings have a big character set (say any Unicode character).
The strings usually meet the criterion (say 2/3 have no repeating characters).

The using code would look like this:

>>> candidate_strings = ["foobnehg", "barfnehg", "bazfnehg"]
>>> result_strings = [s if unique_chars(s) for s in candidate_strings]
>>> print(result_strings)
["barfnehg", "bazfnehg"]

I implemented a naive version, simply iterating the string:

def unique_chars_naive(string_given):
    """
    Checks if a given string contains only unique characters.
    This version iterates the given string, saving all occurred characters.
    """
    chars_seen = []
    for char in string_given:
        if char in chars_seen:
            return False
        chars_seen.append(char)
    return True

My next-best idea was to use a set, so I implemented that:

def unique_chars_set(string_given):
    """
    Checks if a given string contains only unique characters.
    This version exploits that a set contains only unique entries.
    """
    return len(string_given) == len(set(string_given))

Saving the functions to a file UniqueCharacters.py, timed them:

$ python3 -m timeit -n 100000 --setup='import UniqueCharacters; candidate_strings = ["foobnehg", "barfnehg", "bazfnehg"]' '[UniqueCharacters.unique_chars_naive(s) for s in candidate_strings]'
100000 loops, best of 3: 20.3 usec per loop

$ python3 -m timeit -n 100000 --setup='import UniqueCharacters; candidate_strings = ["foobnehg", "barfnehg", "bazfnehg"]' '[UniqueCharacters.unique_chars_set(s) for s in candidate_strings]'
100000 loops, best of 3: 17.7 usec per loop

This shows that the unique_chars_set is faster by about 15 % for this dataset.

Is there a faster way to do this? With regular expressions maybe? Is there some method in the standard library that does this?

317

asked Apr 01 '13 20:04

Bengt

1 Answers

Let me start off by saying that I suspect that you are optimizing when you don't need to. Python is a high-level language that supports thinking about computation in a high-level manner. A solution that is readable, elegant, and reusable is often going to be better than one that is blazingly fast, but hard to understand.

When, and only when, you determine that speed is an issue, then you should proceed with the optimizations. Perhaps even write a C extension for the computationally intense parts.

That being said, here's a comparison of a few techniques:

def unique_chars_set(s):
    return len(s) == len(set(s))

def unique_chars_frozenset(s):
    return len(s) == len(frozenset(s))

def unique_chars_counter(s):
    return Counter(s).most_common(1)[0][1] > 1

def unique_chars_sort(s):
    ss = ''.join(sorted(s))
    prev = ''
    for c in ss:
        if c == prev:
            return False
        prev = c
    return True

def unique_chars_bucket(s):
    buckets = 255 * [False]
    for c in s:
        o = ord(c)
        if buckets[o]:
            return False
        buckets[o] = True
    return True

And here is the performance comparisons (in IPython):

In [0]: %timeit -r10 [unique_chars_set(s) for s in candidate_strings]
100000 loops, best of 10: 6.63 us per loop

In [1]: %timeit -r10 [unique_chars_frozenset(s) for s in candidate_strings]
100000 loops, best of 10: 6.81 us per loop

In [2]: %timeit -r10 [unique_chars_counter(s) for s in candidate_strings]
10000 loops, best of 10: 83.1 us per loop

In [3]: %timeit -r10 [unique_chars_sort(s) for s in candidate_strings]
100000 loops, best of 10: 13.1 us per loop

In [4]: %timeit -r10 [unique_chars_bucket(s) for s in candidate_strings]
100000 loops, best of 10: 15 us per loop

Conclusion: set is elegant and faster than many other obvious methods. But the differences are so small, it doesn't matter anyway.

For more benchmarks, see @FrancisAvila's answer.

105

answered Sep 30 '22 01:09

voithos

Related questions
                            
                                What is the shortest way in .NET to sort strings starting with 1, 10 and 2 and respect the number ordering?
                            
                                String.Replace(char, char) or Replace(string, string)?
                            
                                What's the worst case complexity for KMP when the goal is to find all occurrences of a certain string?
                            
                                Bash script capturing output to terminal
                            
                                TextView.setText (Android) is causing crashes.. any idea why?
                            
                                What does ?/ mean in Ruby? [duplicate]
                            
                                What is the idiomatic way to convert a String to &str?
                            
                                Adding a DataFrame column with len() of another column's values
                            
                                concatenate numpy string array along an axis?
                            
                                Find all occurrences of substring in string in Java
                            
                                XSLT string with HTML entities - How can I get it to render as HTML?
                            
                                store a text string in latex and then add other text to it (concatenate)
                            
                                Replace all occurrences of substring in a string - which is more efficient in Java?
                            
                                Modifying underlying char array of a c++ string object
                            
                                Can we call String as a wrapper class?
                            
                                How do I calculate the number of times a word occurs in a sentence?
                            
                                How to convert dataframe column names from strings into arguments suitable for (qplot, ggplot2)?
                            
                                How to find all possible substrings in a string?
                            
                                String Interning
                            
                                Accessing characters in a string using array syntax

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the fastest way to check if a string contains repeating characters in Python 3?

Tags:

string

list

python-3.x

set

filter

Bengt

People also ask

1 Answers

voithos

Recent Activity

Donate For Us