I'm using Python regexes in a criminally inefficient manner

Question

My goal here is to create a very simple template language. At the moment, I'm working on replacing a variable with a value, like this:

This input:

The Web

Should produce this output:

The Web This Is A Test Variable

I've got it working. But looking at my code, I'm running multiple identical regexes on the same strings -- that just offends my sense of efficiency. There's got to be a better, more Pythonic way. (It's the two "while" loops that really offend.)

This does pass the unit tests, so if this is silly premature optimization, tell me -- I'm willing to let this go. There may be dozens of these variable definitions and uses in a document, but not hundreds. But I suspect there's obvious (to other people) ways of improving this, and I'm curious what the StackOverflow crowd will come up with.

def stripMatchedQuotes(item):
    MatchedSingleQuotes = re.compile(r"'(.*)'", re.LOCALE)
    MatchedDoubleQuotes = re.compile(r'"(.*)"', re.LOCALE)
    item = MatchedSingleQuotes.sub(r'\1', item, 1)
    item = MatchedDoubleQuotes.sub(r'\1', item, 1)
    return item




def processVariables(item):
    VariableDefinition = re.compile(r'<%(.*?)=(.*?)%>', re.LOCALE)
    VariableUse = re.compile(r'<%(.*?)%>', re.LOCALE)
    Variables={}

    while VariableDefinition.search(item):
        VarName, VarDef = VariableDefinition.search(item).groups()
        VarName = stripMatchedQuotes(VarName).upper().strip()
        VarDef = stripMatchedQuotes(VarDef.strip())
        Variables[VarName] = VarDef
        item = VariableDefinition.sub('', item, 1)

    while VariableUse.search(item):
        VarName = stripMatchedQuotes(VariableUse.search(item).group(1).upper()).strip()
        item = VariableUse.sub(Variables[VarName], item, 1)

    return item

Brian · Accepted Answer

The first thing that may improve things is to move the re.compile outside the function. The compilation is cached, but there is a speed hit in checking this to see if its compiled.

Another possibility is to use a single regex as below:

MatchedQuotes = re.compile(r"(['\"])(.*)\1", re.LOCALE)
item = MatchedQuotes.sub(r'\2', item, 1)

Finally, you can combine this into the regex in processVariables. Taking Torsten Marek's suggestion to use a function for re.sub, this improves and simplifies things dramatically.

VariableDefinition = re.compile(r'<%(["\']?)(.*?)\1=(["\']?)(.*?)\3%>', re.LOCALE)
VarRepl = re.compile(r'<%(["\']?)(.*?)\1%>', re.LOCALE)

def processVariables(item):
    vars = {}
    def findVars(m):
        vars[m.group(2).upper()] = m.group(4)
        return ""

    item = VariableDefinition.sub(findVars, item)
    return VarRepl.sub(lambda m: vars[m.group(2).upper()], item)

print processVariables('<%"TITLE"="This Is A Test Variable"%>The Web <%"TITLE"%>')

Here are my timings for 100000 runs:

Original       : 13.637
Global regexes : 12.771
Single regex   :  9.095
Final version  :  1.846

[Edit] Add missing non-greedy specifier

[Edit2] Added .upper() calls so case insensitive like original version

Torsten Marek · Answer

sub can take a callable as it's argument rather than a simple string. Using that, you can replace all variables with one function call:

>>> import re
>>> var_matcher = re.compile(r'<%(.*?)%>', re.LOCALE)
>>> string = '<%"TITLE"%> <%"SHMITLE"%>'
>>> values = {'"TITLE"': "I am a title.", '"SHMITLE"': "And I am a shmitle."}
>>> var_matcher.sub(lambda m: vars[m.group(1)], string)
'I am a title. And I am a shmitle.

Follow eduffy.myopenid.com's advice and keep the compiled regexes around.

The same recipe can be applied to the first loop, only there you need to store the value of the variable first, and always return "" as replacement.

JesperE · Answer

Never create your own programming language. Ever. (I used to have an exception to this rule, but not any more.)

There is always an existing language you can use which suits your needs better. If you elaborated on your use-case, people may help you select a suitable language.

Dan Udey · Answer

Creating a templating language is all well and good, but shouldn't one of the goals of the templating language be easy readability and efficient parsing? The example you gave seems to be neither.

As Jamie Zawinsky famously said:

Some people, when confronted with a problem, think "I know, I'll use regular expressions!" Now they have two problems.

If regular expressions are a solution to a problem you have created, the best bet is not to write a better regular expression, but to redesign your approach to eliminate their use entirely. Regular expressions are complicated, expensive, hugely difficult to maintain, and (ideally) should only be used for working around a problem someone else created.

JacquesB · Answer

You can match both kind of quotes in one go with r"(\"|')(.*?)\1" - the \1 refers to the first group, so it will only match matching quotes.

I'm using Python regexes in a criminally inefficient manner

Tags:

python

regex

algorithm

optimization

Schof

5 Answers

Brian

Torsten Marek

JesperE

Dan Udey

JacquesB

Recent Activity

Donate For Us

I'm using Python regexes in a criminally inefficient manner

Tags:

python

regex

algorithm

optimization

Schof

5 Answers

Brian

Torsten Marek

JesperE

Dan Udey

JacquesB

Related questions

Recent Activity

Donate For Us