Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

I'm using Python regexes in a criminally inefficient manner

My goal here is to create a very simple template language. At the moment, I'm working on replacing a variable with a value, like this:

This input:

The Web

Should produce this output:

The Web This Is A Test Variable

I've got it working. But looking at my code, I'm running multiple identical regexes on the same strings -- that just offends my sense of efficiency. There's got to be a better, more Pythonic way. (It's the two "while" loops that really offend.)

This does pass the unit tests, so if this is silly premature optimization, tell me -- I'm willing to let this go. There may be dozens of these variable definitions and uses in a document, but not hundreds. But I suspect there's obvious (to other people) ways of improving this, and I'm curious what the StackOverflow crowd will come up with.

def stripMatchedQuotes(item):
    MatchedSingleQuotes = re.compile(r"'(.*)'", re.LOCALE)
    MatchedDoubleQuotes = re.compile(r'"(.*)"', re.LOCALE)
    item = MatchedSingleQuotes.sub(r'\1', item, 1)
    item = MatchedDoubleQuotes.sub(r'\1', item, 1)
    return item




def processVariables(item):
    VariableDefinition = re.compile(r'<%(.*?)=(.*?)%>', re.LOCALE)
    VariableUse = re.compile(r'<%(.*?)%>', re.LOCALE)
    Variables={}

    while VariableDefinition.search(item):
        VarName, VarDef = VariableDefinition.search(item).groups()
        VarName = stripMatchedQuotes(VarName).upper().strip()
        VarDef = stripMatchedQuotes(VarDef.strip())
        Variables[VarName] = VarDef
        item = VariableDefinition.sub('', item, 1)

    while VariableUse.search(item):
        VarName = stripMatchedQuotes(VariableUse.search(item).group(1).upper()).strip()
        item = VariableUse.sub(Variables[VarName], item, 1)

    return item
like image 803
Schof Avatar asked Sep 28 '08 20:09

Schof


5 Answers

The first thing that may improve things is to move the re.compile outside the function. The compilation is cached, but there is a speed hit in checking this to see if its compiled.

Another possibility is to use a single regex as below:

MatchedQuotes = re.compile(r"(['\"])(.*)\1", re.LOCALE)
item = MatchedQuotes.sub(r'\2', item, 1)

Finally, you can combine this into the regex in processVariables. Taking Torsten Marek's suggestion to use a function for re.sub, this improves and simplifies things dramatically.

VariableDefinition = re.compile(r'<%(["\']?)(.*?)\1=(["\']?)(.*?)\3%>', re.LOCALE)
VarRepl = re.compile(r'<%(["\']?)(.*?)\1%>', re.LOCALE)

def processVariables(item):
    vars = {}
    def findVars(m):
        vars[m.group(2).upper()] = m.group(4)
        return ""

    item = VariableDefinition.sub(findVars, item)
    return VarRepl.sub(lambda m: vars[m.group(2).upper()], item)

print processVariables('<%"TITLE"="This Is A Test Variable"%>The Web <%"TITLE"%>')

Here are my timings for 100000 runs:

Original       : 13.637
Global regexes : 12.771
Single regex   :  9.095
Final version  :  1.846

[Edit] Add missing non-greedy specifier

[Edit2] Added .upper() calls so case insensitive like original version

like image 72
Brian Avatar answered Nov 16 '22 08:11

Brian


sub can take a callable as it's argument rather than a simple string. Using that, you can replace all variables with one function call:

>>> import re
>>> var_matcher = re.compile(r'<%(.*?)%>', re.LOCALE)
>>> string = '<%"TITLE"%> <%"SHMITLE"%>'
>>> values = {'"TITLE"': "I am a title.", '"SHMITLE"': "And I am a shmitle."}
>>> var_matcher.sub(lambda m: vars[m.group(1)], string)
'I am a title. And I am a shmitle.

Follow eduffy.myopenid.com's advice and keep the compiled regexes around.

The same recipe can be applied to the first loop, only there you need to store the value of the variable first, and always return "" as replacement.

like image 4
Torsten Marek Avatar answered Nov 16 '22 07:11

Torsten Marek


Never create your own programming language. Ever. (I used to have an exception to this rule, but not any more.)

There is always an existing language you can use which suits your needs better. If you elaborated on your use-case, people may help you select a suitable language.

like image 2
JesperE Avatar answered Nov 16 '22 07:11

JesperE


Creating a templating language is all well and good, but shouldn't one of the goals of the templating language be easy readability and efficient parsing? The example you gave seems to be neither.

As Jamie Zawinsky famously said:

Some people, when confronted with a problem, think "I know, I'll use regular expressions!" Now they have two problems.

If regular expressions are a solution to a problem you have created, the best bet is not to write a better regular expression, but to redesign your approach to eliminate their use entirely. Regular expressions are complicated, expensive, hugely difficult to maintain, and (ideally) should only be used for working around a problem someone else created.

like image 2
Dan Udey Avatar answered Nov 16 '22 06:11

Dan Udey


You can match both kind of quotes in one go with r"(\"|')(.*?)\1" - the \1 refers to the first group, so it will only match matching quotes.

like image 1
JacquesB Avatar answered Nov 16 '22 08:11

JacquesB