Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Incrementally finding regular expression matches in streaming data in Python

Tags:

I have data streaming into a number of TCP sockets continuously. For each, I have a different regular expression that I need to pull out matches for. For example, one might match numbers of the format ##.# followed by the letter f:

r = re.compile(rb'([0-9][0-9]\.[0-9])f')

Another might match numbers of the format ### preceded by the letter Q:

r = re.compile(rb'Q([0-9][0-9][0-9])')

In reality, the expressions may be of arbitrary length and complexity, and are pulled from configuration files and not known in advance. They are not hard-coded.

When new data comes in, I append it to a buffer of type bytearray() (here called self.buffer). Then I call a function like this (with self.r being the compiled regular expression):

def advance(self):
    m = self.r.search(self.buffer)

    # No match. Return.
    if m is None:
        return None

    # Match. Advance the buffer and return the matched groups.
    self.buffer = self.buffer[m.end():]
    return m.groups()

If there is no match yet, it returns None. If there is a match, it returns the match and discards the buffer up to the end of the match, making itself ready to be called again.

However, this method isn't particularly efficient. The problem is that the buffer has to be scanned from the beginning over and over again--whenever new data comes in--until a match is found. This could be thousands of times and millions of characters scanned repeatedly, before a match is found and the start of the buffer can be advanced.

I can't simply discard the contents of the buffer before a match is found because a match might need the last few bytes of the buffer (or even the entire buffer). Once more data comes in, the end of the buffer could be the start of a match.

How can I rewrite my "advance" function to safely discard portions of the buffer that could never contain the start of the regular expression so the entire buffer needn't be scanned repeatedly?

One possibility: Is there an alternative to "search" that returns something other than "None" if the reason the match wasn't found was because the end of the string was reached? And, if so, can I get the starting position of the potential match? That would allow me to discard the buffer up to that point.

Another possibility: Some type of library that is smart enough to rewrite arbitrary regular expressions so they can match in a different--and detectable--way on truncated strings.

I would entertain other possibilities as well, but they do need to work with arbitrary regular expressions and not just the simple ones above. Ideally, they also wouldn't involve scanning the buffer twice (once to find an actual potential match and once to discard things).

like image 292
JohnSpeeks Avatar asked Mar 27 '17 03:03

JohnSpeeks


1 Answers

The 3rd-party regex module (not re) offers partial match support, which is a partial solution. (Lookbehinds, the ^ anchor, zero-width matches, and the \b/\B anchors all break in subtle or not-so-subtle ways when you try to discard the start of the window and continue the search. With how many edge cases I've thought of so far, I wouldn't be surprised if there are more.)

If you pass partial=True to regex.match, regex.search, regex.fullmatch, or regex.finditer, then in addition to reporting ordinary, complete matches, it will also report things that don't match, but could if the string was extended:

In [8]: import regex

In [9]: regex.search(r'1234', '', partial=True)
Out[9]: <regex.Match object; span=(0, 0), match='', partial=True>

In [10]: regex.search(r'1234', '12', partial=True)
Out[10]: <regex.Match object; span=(0, 2), match='12', partial=True>

In [11]: regex.search(r'1234', '12 123', partial=True)
Out[11]: <regex.Match object; span=(3, 6), match='123', partial=True>

In [12]: regex.search(r'1234', '1234 123', partial=True)
Out[12]: <regex.Match object; span=(0, 4), match='1234'>

You can determine whether a match was partial or complete with the match object's partial attribute:

In [13]: regex.search(r'1234', '12 123', partial=True).partial
Out[13]: True

In [14]: regex.search(r'1234', '1234 123', partial=True).partial
Out[14]: False

It'll report a match as partial if more data might change the match result:

In [21]: regex.search(r'.*', 'asdf', partial=True)
Out[21]: <regex.Match object; span=(0, 4), match='asdf', partial=True>

In [22]: regex.search(r'ham(?: and eggs)?', 'ham', partial=True)
Out[22]: <regex.Match object; span=(0, 3), match='ham', partial=True>

or if more data might cause a match to not be a match:

In [23]: regex.search(r'1(?!234)', '1', partial=True)
Out[23]: <regex.Match object; span=(0, 1), match='1', partial=True>

In [24]: regex.search(r'1(?!234)', '13', partial=True)
Out[24]: <regex.Match object; span=(0, 1), match='1'>

When you reach the end of the data stream, you should turn off partial to let regex know that this is the end, so partial matches don't hide complete matches.

With partial match information, you can discard everything up to the start of a partial match and know that none of the discarded data would have been in a match... but lookbehinds might need that data, so it'd take messy extra work to support lookbehinds if you're doing this. ^ will also get confused by the start of the string changing, \b/\B won't know whether there was a word character at the end of the discarded data, and it'd be tricky to get zero-width match behavior correct, for whatever definition of "correct" you choose. I suspect some of the other advanced features regex offers might also interact strangely if you discard data this way; regex has a lot of features.

like image 194
user2357112 supports Monica Avatar answered Sep 21 '22 11:09

user2357112 supports Monica