How to efficiently match an input string against several regular expressions at once?

Tags:

How would one efficiently match one input string against any number of regular expressions?

One scenario where this might be useful is with REST web services. Let's assume that I have come up with a number of URL patterns for a REST web service's public interface:

/user/with-id/{userId}
/user/with-id/{userId}/profile
/user/with-id/{userId}/preferences
/users
/users/who-signed-up-on/{date}
/users/who-signed-up-between/{fromDate}/and/{toDate}
…

where {…} are named placeholders (like regular expression capturing groups).

_{Note: This question is not about whether the above REST interface is well-designed or not. (It probably isn't, but that shouldn't matter in the context of this question.)}

It may be assumed that placeholders usually do not appear at the very beginning of a pattern (but they could). It can also be safely assumed that it is impossible for any string to match more than one pattern.

Now the web service receives a request. Of course, one could sequentially match the requested URI against one URL pattern, then against the next one, and so on; but that probably won't scale well for a larger number of patterns that must be checked.

Are there any efficient algorithms for this?

Inputs:

An input string
A set of "mutually exclusive" regular expressions (ie. no input string may match more than one expression)

Output:

The regular expression (if any) that the input string matched against.

444

asked Aug 13 '11 10:08

stakx - no longer contributing

2 Answers

The Aho-Corasick algorithm is a very fast algorithm to match an input string against a set of patterns (actually keywords), that are preprocessed and organized in a trie, to speedup matching.

There are variations of the algorithm to support regex patterns (ie. http://code.google.com/p/esmre/ just to name one) that are probably worth a look.

Or, you could split the urls in chunks, organize them in a tree, then split the url to match and walk the tree one chunk at a time. The {userId} can be considered a wildcard, or match some specific format (ie. be an int).

When you reach a leaf, you know which url you matched

168

answered Oct 10 '22 23:10

Savino Sguera

The standard solution for matching multiple regular expressions against an input stream is a lexer-generator such as Flex (there are lots of these avalable, typically several for each programming langauge).

These tools take a set of regular expressions associated with "tokens" (think of tokens as just names for whatever a regular expression matches) and generates efficient finite-state automata to match all the regexes at the same time. This is linear time with a very small constant in the size of the input stream; hard to ask for "faster" than this. You feed it a character stream, and it emits the token name of the regex that matches "best" (this handles the case where two regexes can match the same string; see the lexer generator for the definition of this), and advances the stream by what was recognized. So you can apply it again and again to match the input stream for a series of tokens.

Different lexer generators will allow you to capture different bits of the recognized stream in differnt ways, so you can, after recognizing a token, pick out the part you care about (e.g., for a literal string in quotes, you only care about the string content, not the quotes).

answered Oct 10 '22 23:10

Ira Baxter

Related questions
                            
                                Issue with a Look-behind Regular expression (Ruby)
                            
                                Apache rewrite subnet ip range
                            
                                Match string, but only if not preceded by other string
                            
                                Regex for no duplicate characters from a limited character pool
                            
                                How do I add a character at a specific position in a string?
                            
                                Splitting a nested string keeping quotation marks
                            
                                What do the "(?<!…)" symbols mean in a Python regular expression?
                            
                                Python3 regex on bytes variable [duplicate]
                            
                                Use RegExp to match a parenthetical number then increment it
                            
                                Trying to find groups of letters with regex
                            
                                Regular Expression to Match " | "
                            
                                Can java.util.regex.Pattern do partial matches?
                            
                                What regular expression do I need to check for some non-latin characters?
                            
                                Powershell replace lose line breaks
                            
                                Regex to match . (periods marking end of sentences) but not Mr. (as in Mr. Hopkins)
                            
                                Search for a word in a String
                            
                                Emacs regexp groups in regex-replace
                            
                                Regular Expressions - how to replace a character within quotes
                            
                                Java Regular Expression running very slow
                            
                                Select-String to grep but just return unique groups

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to efficiently match an input string against several regular expressions at once?

Tags:

rest

regex

pattern-matching

stakx - no longer contributing

People also ask

2 Answers

Savino Sguera

Ira Baxter

Recent Activity

Donate For Us