Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently match an input string against several regular expressions at once?

How would one efficiently match one input string against any number of regular expressions?

One scenario where this might be useful is with REST web services. Let's assume that I have come up with a number of URL patterns for a REST web service's public interface:

  • /user/with-id/{userId}
  • /user/with-id/{userId}/profile
  • /user/with-id/{userId}/preferences
  • /users
  • /users/who-signed-up-on/{date}
  • /users/who-signed-up-between/{fromDate}/and/{toDate}

where {…} are named placeholders (like regular expression capturing groups).

Note: This question is not about whether the above REST interface is well-designed or not. (It probably isn't, but that shouldn't matter in the context of this question.)

It may be assumed that placeholders usually do not appear at the very beginning of a pattern (but they could). It can also be safely assumed that it is impossible for any string to match more than one pattern.

Now the web service receives a request. Of course, one could sequentially match the requested URI against one URL pattern, then against the next one, and so on; but that probably won't scale well for a larger number of patterns that must be checked.

Are there any efficient algorithms for this?

Inputs:

  • An input string
  • A set of "mutually exclusive" regular expressions (ie. no input string may match more than one expression)

Output:

  • The regular expression (if any) that the input string matched against.
like image 444
stakx - no longer contributing Avatar asked Aug 13 '11 10:08

stakx - no longer contributing


People also ask

What is multiline matching?

Multiline option, it matches either the newline character ( \n ) or the end of the input string. It does not, however, match the carriage return/line feed character combination.

Is regex matching fast?

Regular expression matching can be simple and fast, using finite automata-based techniques that have been known for decades. In contrast, Perl, PCRE, Python, Ruby, Java, and many other languages have regular expression implementations based on recursive backtracking that are simple but can be excruciatingly slow.

How do you match expressions in regex?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).

How do you do multiple regex in Python?

made this to find all with multiple #regular #expressions. regex1 = r"your regex here" regex2 = r"your regex here" regex3 = r"your regex here" regexList = [regex1, regex1, regex3] for x in regexList: if re. findall(x, your string): some_list = re. findall(x, your string) for y in some_list: found_regex_list.


2 Answers

The Aho-Corasick algorithm is a very fast algorithm to match an input string against a set of patterns (actually keywords), that are preprocessed and organized in a trie, to speedup matching.

There are variations of the algorithm to support regex patterns (ie. http://code.google.com/p/esmre/ just to name one) that are probably worth a look.

Or, you could split the urls in chunks, organize them in a tree, then split the url to match and walk the tree one chunk at a time. The {userId} can be considered a wildcard, or match some specific format (ie. be an int).

When you reach a leaf, you know which url you matched

like image 168
Savino Sguera Avatar answered Oct 10 '22 23:10

Savino Sguera


The standard solution for matching multiple regular expressions against an input stream is a lexer-generator such as Flex (there are lots of these avalable, typically several for each programming langauge).

These tools take a set of regular expressions associated with "tokens" (think of tokens as just names for whatever a regular expression matches) and generates efficient finite-state automata to match all the regexes at the same time. This is linear time with a very small constant in the size of the input stream; hard to ask for "faster" than this. You feed it a character stream, and it emits the token name of the regex that matches "best" (this handles the case where two regexes can match the same string; see the lexer generator for the definition of this), and advances the stream by what was recognized. So you can apply it again and again to match the input stream for a series of tokens.

Different lexer generators will allow you to capture different bits of the recognized stream in differnt ways, so you can, after recognizing a token, pick out the part you care about (e.g., for a literal string in quotes, you only care about the string content, not the quotes).

like image 33
Ira Baxter Avatar answered Oct 10 '22 23:10

Ira Baxter