I am currently using python's re
module to search and capture groups.
I've list of regular expressions which I have to compile and match against a large dataset which causes performance issues.
Example:
REGEXES = [
'^New York(?P<grp1>\d+/\d+): (?P<grp2>.+)$',
'^Ohio (?P<grp1>\d+/\d+/\d+): (?P<grp2>.+)$',
'(?P<year>\d{4}-\d{1,2}-\d{1,2})$',
'^(?P<year>\d{1,2}/\d{1,2}/\d{2,4})$',
'^(?P<title>.+?)[- ]+E(?P<epi>\d+)$'
.
.
.
.
]
Note: Regexes won't be similar
COMPILED_REGEXES = [re.compile(r, flags=re.I) for r in REGEXES]
def find_match(string):
for regex in COMPILED_REGEXES:
match = regex.search(string)
if not match:
continue
return match
Is there a way around this? The idea is to avoid iteration through the compiled regexes to get a match.
Algorithm for the conversion of Regular Expression to NFAε−𝐜𝐥𝐨𝐬𝐮𝐫𝐞 (𝐬) − It is the set of states that can be reached form state s on ε−transitions alone. If s, t, u states. Initially, ε−closure (s)={s}. If s→t, then ε−closure (s)={s,t}.
The NFA has a single transition from the initial state to the accepting state, and this transition has the regular expression R associated with it. Since the initial state and the accepting state do not have self loops, we conclude that N accepts all words that matches the regular expression R. Namely, L(N) = L(R).
Do any of your regexps break DFA compatibility? Doesn't look like it in your examples. You can use a Python wrapper around a C/C++ DFA implementation like re2, which is a drop in replacement for re
. re2
will also fall back to using re
if the regex is incompatible with the re2
syntax, so it will optimize all possible cases, and not fail on incompatible cases.
Note that re2
does support the (?P<name>regex)
capture syntax, but it doesn't support the (?P=<name>)
backref sytnax.
try:
import re2 as re
re.set_fallback_notification(re.FALLBACK_WARNING)
except ImportError:
# latest version was for Python 2.6
else:
import re
If you have regexps with backrefs, you can still use re2
with a few special considerations: you'll need to replace the backrefs in your regexps with .*?
, and you may find false matches which you can filter out with re
. In real world data, the false matches will probably be uncommon.
Here is an illustrative example:
import re
try:
import re2
re2.set_fallback_notification(re2.FALLBACK_WARNING)
except ImportError:
# latest version was for Python 2.6
REGEXES = [
'^New York(?P<grp1>\d+/\d+): (?P<grp2>.+)$',
'^Ohio (?P<grp1>\d+/\d+/\d+): (?P<grp2>.+)$',
'(?P<year>\d{4}-\d{1,2}-\d{1,2})$',
'^(?P<year>\d{1,2}/\d{1,2}/\d{2,4})$',
'^(?P<title>.+?)[- ]+E(?P<epi>\d+)$',
]
COMPILED_REGEXES = [re.compile(r, flags=re.I) for r in REGEXES]
# replace all backrefs with .*? for re2 compatibility
# is there other unsupported syntax in REGEXES?
COMPILED_REGEXES_DFA = [re2.compile(re2.sub(r'\\d|\\g\\d|\\g\<\d+\>|\\g\<\w+\>', '.*?', r), flags=re2.I) for r in REGEXES]
def find_match(string):
for regex, regex_dfa in zip(COMPILED_REGEXES, COMPILED_REGEXES_DFA):
match_dfa = regex_dfa.search(string)
if not match_dfa:
continue
match = regex.search(string)
# most likely branch comes first for better branch prediction
if match:
return match
If this isn't fast enough, you can employ a variety of techniques to feed the DFA hits to re
as they are processed, instead of storing them in a file or in memory and handing them off once they're all collected.
You can also combine all your regexps into one big DFA regexp of alternating groups (r1)|(r2)|(r3)| ... |(rN)
and iterate through your group matches on the resulting match object to try to match only the corresponding original regexps. The match result object will have the same state as with OP's original solution.
# rename group names in regexeps to avoid name collisions
REGEXES_PREFIXED = [re2.sub(r'\(\?P\<(\w+)\>', r'(P<re{}_\1>'.format(idx), r) for idx, r in enumerate(REGEXES)]
# wrap and fold regexps (?P<hit0>pattern)| ... |(?P<hitN>pattern)
REGEX_BIG = ''
for idx, r in enumerate(REGEXES_PREFIXED):
REGEX_BIG += '(?P<hit{}>{})|'.format(idx, r)
else:
REGEX_BIG = REGEX_BIG[0:-1]
regex_dfa_big = re2.compile(REGEX_BIG, flags = re2.I)
def find_match(string):
match_dfa = regex_dfa_big.search(string)
if match_dfa:
# only interested in hit# match groups
hits = [n for n, _ in match_dfa.groupdict().iteritems() if re2.match(r'hit\d+', n)]
# check for false positives
for idx in [int(h.replace('hit', '')) for h in hits]
match = COMPILED_REGEXES[idx].search(string)
if match:
return match
You can also look at pyre which is a better maintained wrapper for the same C++ library, but not a drop in replacement for re
. There's also a Python Wrapper for RuRe, which is the fastest regex engine I know of.
To elaborate my comment: the problem with putting it all in one big regexp is that group names must be unique. However, you could process your regexes as follows:
import re
REGEXES = [
r'^New York(?P<grp1>\d+/\d+): (?P<grp2>.+)$',
r'^Ohio (?P<grp1>\d+/\d+/\d+): (?P<grp2>.+)$',
r'(?P<year>\d{4}-\d{1,2}-\d{1,2})$',
r'^(?P<year>\d{1,2}/\d{1,2}/\d{2,4})$',
r'^(?P<title>.+?)[- ]+E(?P<epi>\d+)$']
# Find the names of groups in the regexps
groupnames = {'RE_%s'%i:re.findall(r'\(\?P<([^>]+)>', r) for i, r in enumerate(REGEXES)}
# Convert the named groups into unnamed ones
re_list_cleaned = [re.sub(r'\?P<([^>]+)>', '', r) for r in REGEXES]
# Wrap each regexp in a named group
token_re_list = ['(?P<RE_%s>%s)'%(i, r) for i, r in enumerate(re_list_cleaned)]
# Put them all together
mighty_re = re.compile('|'.join(token_re_list), re.MULTILINE)
# Use the regexp to process a big file
with open('bigfile.txt') as f:
txt = f.read()
for match in mighty_re.finditer(txt):
# Now find out which regexp made the match and put the matched data in a dictionary
re_name = match.lastgroup
groups = [g for g in match.groups() if g is not None]
gn = groupnames[re_name]
matchdict = dict(zip(gn, groups[1:]))
print ('Found:', re_name, matchdict)
I suggest do following steps:
Create an excel called Patterns.csv and have two columns in it Patterns & Name where pattern is the regex pattern like ^New York(?P<grp1>\d+/\d+): (?P<grp2>.+)$'
and name can be New York
. This will help you in maintaining all the regex in a separate resource other than your code. It will help you in future if you want to add/subtract/modify regexes.
Read that csv using below command:
import pandas as pd
df = pd.read_csv("\\Patterns.csv")
Write code to parse this csv as below:
pattern = df['pattern'].tolist()
pattern_name = df['name'].tolist()
pattern_dict = dict(zip(pattern_name, pattern))
Write a pattern regex to find out all the values that are matching:
import collections
sep = " ;; "
NLU_Dict=collections.defaultdict()
for pn, p in pattern_dict.items():
val = sep.join([sep.join(filter(lambda x: len(str(x).strip()) >0, map(str, v))) for in re.findall(p, text, re.I)])
NLU_Dict[pn] = val
Your NLU_Dict
will be a dict. separated by ;;
containing values of pattern names which are matched and blank for what it not matched.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With