Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient (basic) regular expression implementation for streaming data

I'm looking for an implementation of regular expression matching that operates on a stream of data -- i.e., it has an API that allows a user to pass in one character at a time and report when a match is found on the stream of characters seen so far. Only very basic (classic) regular expressions are needed, so a DFA/NFA based implementation seems like it would be well-suited to the problem.

Based on the fact that it's possible to do regular expression matching using a DFA/NFA in a single linear sweep, it seems like a streaming implementation should be possible.

Requirements:

  • The library should not try to wait until the full string has been read before performing the match. The data I have really is streaming; there is no way to know how much data will arrive, it's not possible to seek forward or backward.

  • Implementing specific stream matching for a couple special cases is not an option, as I don't know in advance what patterns a user might want to look for.

  • Language: usable from C/C++

For the curious, my use case is the following: I have a system which intercepts memory writes inside a full system emulator, and I would like to have a way to identify memory writes that match a regular expression (e.g., one could use this to find the point in the system where a URL is written to memory).

I have found:

Apply a Regex on Stream?

Applying a regular expression to a Java I/O Stream

Code Guru - Building a Regular Expression Stream Search with the .NET Framework

But all of these attempt to convert the stream to a string first and then use a stock regular expression library.

Another thought I had was to modify the RE2 library, but according to the author it is architected around the assumption that the entire string is in memory at the same time.

If nothing's available, then I can start down the unhappy path of reinventing this wheel to fit my own needs, but I'd really rather not if I can avoid it. Any help would be greatly appreciated!

like image 391
Brendan Dolan-Gavitt Avatar asked Oct 12 '12 03:10

Brendan Dolan-Gavitt


2 Answers

I have exactly what you're looking for: https://github.com/agentzh/sregex http://agentzh.org/misc/slides/yapc-na-2013-sregex.pdf‎

And if you know javascript (or want a javascript version), I have an exercise that lets you easily do that using an existing RE implementation: https://github.com/dhruvbird/regexp-js

like image 200
dhruvbird Avatar answered Nov 05 '22 11:11

dhruvbird


Intel has released hyperscan library that does exactly what is needed. Here it is a brief overview of it.

like image 21
Dmitry Marchuk Avatar answered Nov 05 '22 12:11

Dmitry Marchuk