I have a need to match a string to see if it starts with one of about 40 strings, and this method gets called a heck of a lot.
Currently it does
for pref, newval in list_of_prefixes:
if oldval.startswith(pref):
return newval
return oldval
However, given it's called a huge number of times, it makes sense for this to be efficient as possible. I could ensure list_of_prefixes is sorted, and then drop out of the loop as soon as pref > oldval, but that doesn't seem to gain a great deal.
Currently the largest number of input values fall somewhere between two of the prefixes, so I could explicitly test for that, or search in reverse order, but although this is efficient for the data set now, it might be less efficient should the data set change.
Originally, there was only 1 possible prefix, so performance was perhaps less of an issue.
I looked at string.startswith(tuple()) but that seems only to make it easier to write, and it doesn't tell me which tuple matched, so where there is a match I have to do the check twice.
The startsWith() method returns true if a string starts with a specified string. Otherwise it returns false . The startsWith() method is case sensitive. See also the endsWith() method.
startsWith() The startsWith() method determines whether a string begins with the characters of a specified string, returning true or false as appropriate.
The startswith() method returns True if the string starts with the specified value, otherwise False.
Java String startsWith() The Java String class startsWith() method checks if this string starts with the given prefix. It returns true if this string starts with the given prefix; else returns false.
With a compiled regular expression, I would expect the overhead of compiling a regex to pay itself back when you have more than just a few strings. Basically, the compiled regex is an automaton which runs out of traversable paths pretty quickly if the prefix is not one which the automaton recognizes. Especially if all the matches are anchored to the beginning of the string, it should fail very quickly when there is no match.
import re
prefixes = ['foo', 'bar', 'baz']
rx = re.compile(''.join(['^(?:', '|'.join(prefixes), ')']))
for line in input:
match = rx.match(line)
if match:
matched = match.group(0)
If you need a more complex regular expression (say, one with trailing context after the closing parenthesis), you will want to use regular grouping parentheses (
instead of non-grouping (?:
, and fetch group(1)
instead.
Here is the same with a dictionary mapping prefixes to replacements:
prefixes = {'foo': 'nu', 'bar': 'beer', 'baz': 'base'}
rx = re.compile(''.join(['^(?:', '|'.join(prefixes.keys()), ')']))
for line in input:
match = rx.match(line)
if match:
newval = prefixes[match.group(0)]
Actually, as pointed out in comments, the ^
is not strictly necessary with re.match()
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With