Bear with me, I can't include my 1,000+ line program, and there are a couple of questions in the description.
So I have a couple types of patterns I am searching for:
#literally just a regular word
re.search("Word", arg)
#Varying complex pattern
re.search("[0-9]{2,6}-[0-9]{2}-[0-9]{1}", arg)
#Words with varying cases and the possibility of ending special characters
re.search("Supplier [Aa]ddress:?|Supplier [Ii]dentification:?|Supplier [Nn]ame:?", arg)
#I also use re.findall for the above patterns as well
re.findall("uses patterns above", arg
I have about 75 of these in total, and some need to be moved to deeply nested functions
When and where should I compile the patterns?
Right now I am trying to improve my program by compiling everything in main, then pass the correct list of compiled RegexObjects to the function that uses it. Would this increase my performance?
Would doing something like the following increase the speed of my program?
re.compile("pattern").search(arg)
Does the compiled patterns stay in memory so if a function is called multiple times with this in it would it skip the compiling part? So I wouldn't have to move data from function to function.
Is it even worth compiling all of the patterns if I move the data so much?
Is there a better way to match regular words without regex?
Short example of my code:
import re
def foo(arg, allWords):
#Does some things with arg, then puts the result into a variable,
# this function does not use allWords
data = arg #This is the manipulated version of arg
return(bar(data, allWords))
def bar(data, allWords):
if allWords[0].search(data) != None:
temp = data.split("word1", 1)[1]
return(temp)
elif allWords[1].search(data) != None:
temp = data.split("word2", 1)[1]
return(temp)
def main():
allWords = [re.compile(m) for m in ["word1", "word2", "word3"]]
arg = "This is a very long string from a text document input, the provided patterns might not be word1 in this string but I need to check for them, and if they are there do some cool things word3"
#This loop runs a couple million times
# because it loops through a couple million text documents
while True:
data = foo(arg, allWords)
The re. compile(pattern, repl, string): We can combine a regular expression pattern into pattern objects, which can be used for pattern matching. It also helps to search a pattern again without rewriting it.
transitive + intransitive. : to compile again Last year's revival of composer Joseph Rumshinsky's Yiddish classic was made possible by the work of musicologist Michael Ochs, who labored for the better part of a decade recompiling the lost work after it slid into obscurity.—
A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).
The re.search() and re. match() both are functions of re module in python. These functions are very efficient and fast for searching in strings. The function searches for some substring in a string and returns a match object if found, else it returns none.
This is a tricky subject: many answers, even some legitimate sources such as David Beazley's Python Cookbook, will tell you something like:
[Use
compile()
] when you’re going to perform a lot of matches using the same pattern. This lets you compile the regex only once versus at each match. [see p. 45 of that book]
However, that really hasn't been true since sometime around Python 2.5. Here's a note straight out of the re
docs:
Note The compiled versions of the most recent patterns passed to
re.compile()
and the module-level matching functions are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.
There are two small arguments against this, but (anecdotally speaking) these won't result in noticeable timing differences the majority of the time:
Here's a rudimentary test of the above using the 20 newsgroups text dataset. On a relative basis, the improvement in speed is about 1.6% with compiling, presumably due mostly to cache lookup.
import re
from sklearn.datasets import fetch_20newsgroups
# A list of length ~20,000, paragraphs of text
news = fetch_20newsgroups(subset='all', random_state=444).data
# The tokenizer used by most text-processing vectorizers such as TF-IDF
regex = r'(?u)\b\w\w+\b'
regex_comp = re.compile(regex)
def no_compile():
for text in news:
re.findall(regex, text)
def with_compile():
for text in news:
regex_comp.findall(text)
%timeit -r 3 -n 5 no_compile()
1.78 s ± 16.2 ms per loop (mean ± std. dev. of 3 runs, 5 loops each)
%timeit -r 3 -n 5 with_compile()
1.75 s ± 12.2 ms per loop (mean ± std. dev. of 3 runs, 5 loops each)
That really only leaves one very defensible reason to use re.compile()
:
By precompiling all expressions when the module is loaded, the compilation work is shifted to application start time, instead of to a point when the program may be responding to a user action. [source; p. 15]. It's not uncommon to see constants declared at the top of a module with
compile
. For example, in smtplib you'll findOLDSTYLE_AUTH = re.compile(r"auth=(.*)", re.I)
.
Note that compiling happens (eventually) whether or not you use re.compile()
. When you do use compile()
, you're compiling the passed regex at that moment. If you use the module-level functions like re.search()
, you're compiling and searching in this one call. The two processes below are equivalent in this regard:
# with re.compile - gets you a regular expression object (class)
# and then call its method, `.search()`.
a = re.compile('regex[es|p]') # compiling happens now
a.search('regexp') # searching happens now
# with module-level function
re.search('regex[es|p]', 'regexp') # compiling and searching both happen here
Lastly you asked,
Is there a better way to match regular words without regex?
Yes; this is mentioned as a "common problem" in the HOWTO:
Sometimes using the re module is a mistake. If you’re matching a fixed string, or a single character class, and you’re not using any re features such as the IGNORECASE flag, then the full power of regular expressions may not be required. Strings have several methods for performing operations with fixed strings and they’re usually much faster, because the implementation is a single small C loop that’s been optimized for the purpose, instead of the large, more generalized regular expression engine. [emphasis added]
...
In short, before turning to the re module, consider whether your problem can be solved with a faster and simpler string method.
Let's say that word1
, word2
... are regexes:
let's rewrite those parts:
allWords = [re.compile(m) for m in ["word1", "word2", "word3"]]
I would create one single regex for all patterns:
allWords = re.compile("|".join(["word1", "word2", "word3"])
To support regexes with |
in them, you would have to parenthesize the expressions:
allWords = re.compile("|".join("({})".format(x) for x in ["word1", "word2", "word3"])
(that also works with standard words of course, and it's still worth using regexes because of the |
part)
now this is a disguised loop with each term hardcoded:
def bar(data, allWords):
if allWords[0].search(data):
temp = data.split("word1", 1)[1] # that works only on non-regexes BTW
return(temp)
elif allWords[1].search(data):
temp = data.split("word2", 1)[1]
return(temp)
can be rewritten simply as
def bar(data, allWords):
return allWords.split(data,maxsplit=1)[1]
in terms of performance:
The last hiccup is that internally the regex engine searches for all expressions in a loop, which makes that a O(n)
algorithm. To make it faster, you would have to predict which pattern is the most frequent, and put it first (my hypothesis is that regexes are "disjoint", which means that a text cannot be matched by several ones, else the longest would have to come before the shorter one)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With