Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Partial Regex Matching

Tags:

python

regex

I am inquiring about partial regex matching in Python.

For Example:

If you have a string:

string = 'foo bar cat dog elephant barn yarn p n a'

And a regex:

pattern = r'foo bar cat barn yard p n a f'

The following would be true:

  • re.match(pattern, string) Would return None.
  • re.search(pattern, string) Would also return None

Although we can all see that the first part of the pattern matches the first part of the string.

So instead of searching for the entirety of the pattern in the string, is there a way to see what percentage of the string matched the pattern?

like image 904
Jason Avatar asked Dec 25 '22 07:12

Jason


2 Answers

yes, it is possible to do partial regular expression matching

I was playing around with this idea of partial matches and found this Q during my searches. I found a way to do what I needed and thought I'd post it here.

It's no speed demon. Probably only useful where speed isn't an issue.

This function finds the best partial match for the regular expression and returns the matching text.

>>> def partial_match(regex, string, flags=0, op=re.match):
...     """
...     Matches a regular expression to a string incrementally, retaining the
...     best substring matched.
...     :param regex:   The regular expression to apply to the string.
...     :param string:  The target string.
...     :param flags:   re module flags, e.g.: `re.I`
...     :param op:      Either of re.match (default) or re.search.
...     :return:        The substring of the best partial match.
...     """
...     m = op(regex, string, flags)
...     if m:
...         return m.group(0)
...     final = None
...     for i in range(1, len(regex) + 1):
...         try:
...             m = op(regex[:i], string, flags)
...             if m:
...                 final = m.group(0)
...         except re.error:
...             pass
...     return final
...     

Testing it out:

>>> partial_match(r".*l.*?iardvark", "bluebird")
'bluebi'
>>> 
>>> partial_match(r"l.*?iardvark", "bluebird")
>>> # None was returned. Try again with search...
>>> 
>>> partial_match(r"l.*?iardvark", "bluebird", op=re.search)
'luebi'
>>>
>>> string = 'foo bar cat dog elephant barn yarn p n a'
>>> pattern = r'foo bar cat barn yard p n a f'
>>> 
>>> partial_match(pattern, string)
'foo bar cat '
>>> 
>>> partial_match(r".* (zoo){1,3}ran away", "the fox at the "
...                                         "zoozoozoozoozoo is "
...                                         "happy")
'the fox at the zoozoozoo'

Behaves as expected. The algorithm keeps trying to match as much of the expression it can to the target string. It continues until the whole expression has been matched up against the target string, retaining the best partial match.

Alright. Now let's see just how slow it really is...

>>> import cProfile as cprof, random as rand, re
>>>
>>> # targets = ['lazy: that# fox~ The; little@ quick! lamb^ dog~ ',
>>> #            << 999 more random strings of random length >>]
>>>
>>> words = """The; quick! brown? fox~ jumped, over. the! lazy: dog~
...            Mary? had. a little- lamb, a& little@ lamb^ {was} she... and,,, 
...            [everywhere] that# Mary* went=, the. "lamb" was; sure() (to) be.
...         """.split()
...
>>> targets = [' '.join(rand.choices(words, k=rand.randint(1, 100))) 
...            for _ in range(1000)]
...
>>> exprs   = ['.*?&', '.*(jumped|and|;)', '.{1,100}[\\.,;&#^]', '.*?!', 
...            '.*?dog. .?lamb.?', '.*?@', 'lamb', 'Mary']
...
>>> partial_match_script = """
... for t in targets:
...     for e in exprs:
...         m = partial_match(e, t)
...         """
...
>>> match_script = """
... for t in targets:
...     for e in exprs:
...         m = re.match(e, t)
...         """
... 
>>> cprof.run(match_script)
         32003 function calls in 0.032 seconds
>>>
>>> cprof.run(partial_match_script)
         261949 function calls (258167 primitive calls) in 0.230 seconds

Running it head-to-head with re.match() which only has to do regular matches and fails most of the time probably isn't a fair comparison of the function's performance. A better comparison would be against a module that supports fuzzy matching, which I did below.. and the function did okay afterall.

A more performant solution could possibly be developed using the re.sre_parse and/or re.sre_compile modules. It looks like all the documentation to go by is in the sources and bits and pieces on the 'net like https://www.programcreek.com/python/example/1434/sre_parse

With those modules, I think there may be a way to apply a regex incrementally by tokens, or by subexpressions, rather than by single characters as I've done.

Also, as someone commented, the regex package has fuzzy matching capabilities (https://pypi.org/project/regex/) - but it behaves a little differently and could permit unexpected characters in the partial match.

>>> import regex
>>>
>>> regex.match(r"(?:a.b.c.d){d}", "a.b.c", regex.ENHANCEMATCH).group(0)
'a.b.c'
>>> regex.match(r"(?:moo ow dog cat){d}", "moo cow house car").group(0)
'moo c'
>>> regex.match(r"(?:moo ow dog cat){d}", "moo cow house car", 
...             regex.ENHANCEMATCH).group(0)
...
'moo c'
>>> # ^^ the 'c' above is not what we want in the output. As you can see,
>>> # the 'fuzzy' matching is a bit different from partial matching.
>>>
>>> regex_script = """
... for t in targets:
...     for e in exprs:
...         m = regex.match(rf"(?:{e}){{d}}", t)
...         """
>>>
>>> cprof.run(regex_script)
         57912 function calls (57835 primitive calls) in 0.180 seconds
...
>>> regex_script = """
... for t in targets:
...     for e in exprs:
...         m = regex.match(rf"(?:{e}){{d}}", t, flags=regex.ENHANCEMATCH)
...         """
>>> 
>>> cprof.run(regex_script)
         57904 function calls (57827 primitive calls) in 0.298 seconds

Performance is a little better than the partial_match() solution without the regex.ENHANCEMATCH flag. However it's slower with the flag.

regex with the regex.BESTMATCH flag is probably most similar to partial_match() in behavior, but it is even slower:

>>> regex_script = """
... for t in targets:
...     for e in exprs:
...         m = regex.match(rf"(?:{e}){{d}}", t, flags=regex.BESTMATCH)
...         """
>>> cprof.run(regex_script)
         57912 function calls (57835 primitive calls) in 0.338 seconds

regex also has a partial=True flag, but this doesn't seem to work at all like we'd expect.

like image 67
Todd Avatar answered Jan 05 '23 06:01

Todd


Not with regular expressions.

from difflib import SequenceMatcher
SequenceMatcher(None, string, pattern).ratio()
# => 0.7536231884057971

You can even match words rather than characters:

SequenceMatcher(None, string.split(), pattern.split()).ratio()
# => 0.7368421052631579
like image 42
Amadan Avatar answered Jan 05 '23 08:01

Amadan