Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Regular Expressions to NFA

I am currently using python's re module to search and capture groups. I've list of regular expressions which I have to compile and match against a large dataset which causes performance issues.

Example:

REGEXES = [
    '^New York(?P<grp1>\d+/\d+): (?P<grp2>.+)$',
    '^Ohio (?P<grp1>\d+/\d+/\d+): (?P<grp2>.+)$',
    '(?P<year>\d{4}-\d{1,2}-\d{1,2})$',
    '^(?P<year>\d{1,2}/\d{1,2}/\d{2,4})$',
    '^(?P<title>.+?)[- ]+E(?P<epi>\d+)$'
    .
    .
    .
    .
]

Note: Regexes won't be similar

COMPILED_REGEXES = [re.compile(r, flags=re.I) for r in REGEXES]

def find_match(string):
    for regex in COMPILED_REGEXES:
        match = regex.search(string)
        if not match:
            continue
        return match

Is there a way around this? The idea is to avoid iteration through the compiled regexes to get a match.

like image 455
Abhi Avatar asked Oct 11 '18 06:10

Abhi


People also ask

How do I turn a regular expression into an NFA?

Algorithm for the conversion of Regular Expression to NFAε−𝐜𝐥𝐨𝐬𝐮𝐫𝐞 (𝐬) − It is the set of states that can be reached form state s on ε−transitions alone. If s, t, u states. Initially, ε−closure (s)={s}. If s→t, then ε−closure (s)={s,t}.

Does a NFA accept a regular expression?

The NFA has a single transition from the initial state to the accepting state, and this transition has the regular expression R associated with it. Since the initial state and the accepting state do not have self loops, we conclude that N accepts all words that matches the regular expression R. Namely, L(N) = L(R).


3 Answers

Do any of your regexps break DFA compatibility? Doesn't look like it in your examples. You can use a Python wrapper around a C/C++ DFA implementation like re2, which is a drop in replacement for re. re2 will also fall back to using re if the regex is incompatible with the re2 syntax, so it will optimize all possible cases, and not fail on incompatible cases.

Note that re2 does support the (?P<name>regex) capture syntax, but it doesn't support the (?P=<name>) backref sytnax.

try:
    import re2 as re
    re.set_fallback_notification(re.FALLBACK_WARNING)
except ImportError:
    # latest version was for Python 2.6
else:
    import re

If you have regexps with backrefs, you can still use re2 with a few special considerations: you'll need to replace the backrefs in your regexps with .*?, and you may find false matches which you can filter out with re. In real world data, the false matches will probably be uncommon.

Here is an illustrative example:

import re
try:
    import re2
    re2.set_fallback_notification(re2.FALLBACK_WARNING)
except ImportError:
    # latest version was for Python 2.6

REGEXES = [
    '^New York(?P<grp1>\d+/\d+): (?P<grp2>.+)$',
    '^Ohio (?P<grp1>\d+/\d+/\d+): (?P<grp2>.+)$',
    '(?P<year>\d{4}-\d{1,2}-\d{1,2})$',
    '^(?P<year>\d{1,2}/\d{1,2}/\d{2,4})$',
    '^(?P<title>.+?)[- ]+E(?P<epi>\d+)$',
]

COMPILED_REGEXES = [re.compile(r, flags=re.I) for r in REGEXES]
# replace all backrefs with .*? for re2 compatibility
# is there other unsupported syntax in REGEXES?
COMPILED_REGEXES_DFA = [re2.compile(re2.sub(r'\\d|\\g\\d|\\g\<\d+\>|\\g\<\w+\>', '.*?', r), flags=re2.I) for r in REGEXES]

def find_match(string):
    for regex, regex_dfa in zip(COMPILED_REGEXES, COMPILED_REGEXES_DFA):
        match_dfa = regex_dfa.search(string)
        if not match_dfa:
            continue
        match = regex.search(string)
        # most likely branch comes first for better branch prediction
        if match:
            return match

If this isn't fast enough, you can employ a variety of techniques to feed the DFA hits to re as they are processed, instead of storing them in a file or in memory and handing them off once they're all collected.

You can also combine all your regexps into one big DFA regexp of alternating groups (r1)|(r2)|(r3)| ... |(rN) and iterate through your group matches on the resulting match object to try to match only the corresponding original regexps. The match result object will have the same state as with OP's original solution.

# rename group names in regexeps to avoid name collisions
REGEXES_PREFIXED = [re2.sub(r'\(\?P\<(\w+)\>', r'(P<re{}_\1>'.format(idx), r) for idx, r in enumerate(REGEXES)]
# wrap and fold regexps (?P<hit0>pattern)| ... |(?P<hitN>pattern)
REGEX_BIG = ''
for idx, r in enumerate(REGEXES_PREFIXED):
    REGEX_BIG += '(?P<hit{}>{})|'.format(idx, r)
else:
    REGEX_BIG = REGEX_BIG[0:-1]
regex_dfa_big = re2.compile(REGEX_BIG, flags = re2.I)

def find_match(string):
    match_dfa = regex_dfa_big.search(string)
    if match_dfa:
        # only interested in hit# match groups
        hits = [n for n, _ in match_dfa.groupdict().iteritems() if re2.match(r'hit\d+', n)]
        # check for false positives
        for idx in [int(h.replace('hit', '')) for h in hits]
            match = COMPILED_REGEXES[idx].search(string)
            if match:
                return match

You can also look at pyre which is a better maintained wrapper for the same C++ library, but not a drop in replacement for re. There's also a Python Wrapper for RuRe, which is the fastest regex engine I know of.

like image 132
okovko Avatar answered Oct 10 '22 07:10

okovko


To elaborate my comment: the problem with putting it all in one big regexp is that group names must be unique. However, you could process your regexes as follows:

import re

REGEXES = [
    r'^New York(?P<grp1>\d+/\d+): (?P<grp2>.+)$',
    r'^Ohio (?P<grp1>\d+/\d+/\d+): (?P<grp2>.+)$',
    r'(?P<year>\d{4}-\d{1,2}-\d{1,2})$',
    r'^(?P<year>\d{1,2}/\d{1,2}/\d{2,4})$',
    r'^(?P<title>.+?)[- ]+E(?P<epi>\d+)$']

# Find the names of groups in the regexps
groupnames = {'RE_%s'%i:re.findall(r'\(\?P<([^>]+)>', r) for i, r in enumerate(REGEXES)}

# Convert the named groups into unnamed ones
re_list_cleaned = [re.sub(r'\?P<([^>]+)>', '', r) for r in REGEXES]

# Wrap each regexp in a named group
token_re_list = ['(?P<RE_%s>%s)'%(i, r) for i, r in enumerate(re_list_cleaned)]

# Put them all together
mighty_re = re.compile('|'.join(token_re_list), re.MULTILINE)

# Use the regexp to process a big file
with open('bigfile.txt') as f:
    txt = f.read()
for match in mighty_re.finditer(txt):
    # Now find out which regexp made the match and put the matched data in a dictionary
    re_name = match.lastgroup
    groups = [g for g in match.groups() if g is not None]
    gn = groupnames[re_name]
    matchdict = dict(zip(gn, groups[1:]))
    print ('Found:', re_name, matchdict)
like image 42
EvertW Avatar answered Oct 10 '22 06:10

EvertW


I suggest do following steps:

  1. Create an excel called Patterns.csv and have two columns in it Patterns & Name where pattern is the regex pattern like ^New York(?P<grp1>\d+/\d+): (?P<grp2>.+)$' and name can be New York. This will help you in maintaining all the regex in a separate resource other than your code. It will help you in future if you want to add/subtract/modify regexes.

  2. Read that csv using below command:

    import pandas as pd
    df = pd.read_csv("\\Patterns.csv")

  3. Write code to parse this csv as below:

    pattern = df['pattern'].tolist() pattern_name = df['name'].tolist() pattern_dict = dict(zip(pattern_name, pattern))

  4. Write a pattern regex to find out all the values that are matching:

import collections sep = " ;; " NLU_Dict=collections.defaultdict() for pn, p in pattern_dict.items(): val = sep.join([sep.join(filter(lambda x: len(str(x).strip()) >0, map(str, v))) for in re.findall(p, text, re.I)]) NLU_Dict[pn] = val

Your NLU_Dict will be a dict. separated by ;; containing values of pattern names which are matched and blank for what it not matched.

like image 41
Rahul Agarwal Avatar answered Oct 10 '22 05:10

Rahul Agarwal