I have a need to match a string to see if it starts with one of about 40 strings, and this method gets called a heck of a lot. Currently it does <pre class="prettyprint"><code>for pref, newval in list_of_prefixes: if oldval.startswith(pref): return newval return oldval </code></pre> However, given it's called a huge number of times, it makes sense for this to be efficient as possible. I could ensure list_of_prefixes is sorted, and then drop out of the loop as soon as pref > oldval, but that doesn't seem to gain a great deal. Currently the largest number of input values fall somewhere between two of the prefixes, so I could explicitly test for that, or search in reverse order, but although this is efficient for the data set now, it might be less efficient should the data set change. Originally, there was only 1 possible prefix, so performance was perhaps less of an issue. I looked at string.startswith(tuple()) but that seems only to make it easier to write, and it doesn't tell me which tuple matched, so where there is a match I have to do the check twice.

With a compiled regular expression, I would expect the overhead of compiling a regex to pay itself back when you have more than just a few strings. Basically, the compiled regex is an automaton which runs out of traversable paths pretty quickly if the prefix is not one which the automaton recognizes. Especially if all the matches are anchored to the beginning of the string, it should fail very quickly when there is no match. <pre class="prettyprint"><code>import re prefixes = ['foo', 'bar', 'baz'] rx = re.compile(''.join(['^(?:', '|'.join(prefixes), ')'])) for line in input: match = rx.match(line) if match: matched = match.group(0) </code></pre> If you need a more complex regular expression (say, one with trailing context after the closing parenthesis), you will want to use regular grouping parentheses <code>(</code> instead of non-grouping <code>(?:</code>, and fetch <code>group(1)</code> instead. Here is the same with a dictionary mapping prefixes to replacements: <pre class="prettyprint"><code>prefixes = {'foo': 'nu', 'bar': 'beer', 'baz': 'base'} rx = re.compile(''.join(['^(?:', '|'.join(prefixes.keys()), ')'])) for line in input: match = rx.match(line) if match: newval = prefixes[match.group(0)] </code></pre> Actually, as pointed out in comments, the <code>^</code> is not strictly necessary with <code>re.match()</code>.

Efficient method of checking if a string starts with one of a set of strings

Tags:

python

I have a need to match a string to see if it starts with one of about 40 strings, and this method gets called a heck of a lot.

Currently it does

for pref, newval in list_of_prefixes:
    if oldval.startswith(pref):
         return newval
return oldval

However, given it's called a huge number of times, it makes sense for this to be efficient as possible. I could ensure list_of_prefixes is sorted, and then drop out of the loop as soon as pref > oldval, but that doesn't seem to gain a great deal.

Currently the largest number of input values fall somewhere between two of the prefixes, so I could explicitly test for that, or search in reverse order, but although this is efficient for the data set now, it might be less efficient should the data set change.

Originally, there was only 1 possible prefix, so performance was perhaps less of an issue.

I looked at string.startswith(tuple()) but that seems only to make it easier to write, and it doesn't tell me which tuple matched, so where there is a match I have to do the check twice.

332

asked Feb 10 '16 12:02

Tom Tanner

1 Answers

With a compiled regular expression, I would expect the overhead of compiling a regex to pay itself back when you have more than just a few strings. Basically, the compiled regex is an automaton which runs out of traversable paths pretty quickly if the prefix is not one which the automaton recognizes. Especially if all the matches are anchored to the beginning of the string, it should fail very quickly when there is no match.

import re

prefixes = ['foo', 'bar', 'baz']
rx = re.compile(''.join(['^(?:', '|'.join(prefixes), ')']))
for line in input:
    match = rx.match(line)
    if match:
        matched = match.group(0)

If you need a more complex regular expression (say, one with trailing context after the closing parenthesis), you will want to use regular grouping parentheses ( instead of non-grouping (?:, and fetch group(1) instead.

Here is the same with a dictionary mapping prefixes to replacements:

prefixes = {'foo': 'nu', 'bar': 'beer', 'baz': 'base'}
rx = re.compile(''.join(['^(?:', '|'.join(prefixes.keys()), ')']))
for line in input:
    match = rx.match(line)
    if match:
        newval = prefixes[match.group(0)]

Actually, as pointed out in comments, the ^ is not strictly necessary with re.match().

answered Sep 24 '22 15:09

tripleee

Related questions
                            
                                Understanding why this python code works randomly
                            
                                Generating random numbers with weighted probabilities in python
                            
                                Python - Cannot upgrade six, issue uninstalling previous version [duplicate]
                            
                                flask-login only works if get_id() returns self.email
                            
                                How to start with Edge webdriver with python 2.7 and Robot framework 2.9
                            
                                django cannot connect to RDS postgresql
                            
                                Finding duplicate matrices in Python?
                            
                                reset_index() to original column indices after pandas groupby()?
                            
                                Unicode elementwise string comparison in numpy
                            
                                How to fix "TraitError: The 'input' trait of a ... instance is 'read only'."
                            
                                Seaborn conditional colors based on value
                            
                                Finding the best combination of lists with maximum function value
                            
                                Odd threading behavior in python
                            
                                How to install mysql-connector for python 3.5.1?
                            
                                Flask {{ request.script_root|tojson|safe }} returns nothing
                            
                                Python keras how to change the size of input after convolution layer into lstm layer
                            
                                How to place custom Jupyter kernels inside virtual environment?
                            
                                Function to determine a reasonable initial guess for scipy.optimize?
                            
                                How to recover from pip freeze exception?
                            
                                Python datetime to XML Schema timestamp format

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With