Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a data structure that provides lookup by pattern (regex)?

I've run into this situation several times: there are multiple patterns some text may match against, and you want to do something specific based on which pattern it is.

In the past I've always just used a list of regular expressions and iterated until finding a match.

What I'm wondering is if there is a more efficient data structure for this. Something like, if I were using C# for example, a Dictionary with Regex keys.

I realize that if the patterns are all prefixes or suffixes, then something like a Trie would make sense. It's not clear to me that this would work for the general case, though.

It also seems to me there could be some ambiguity here around key collisions; e.g., if some text matches more than one pattern, what should be returned? (I would think that maybe a non-deterministic result would be OK in that case; but as long as the behavior were documented I'd be fine with it.)

Anyway, does such a data structure exist, either in .NET or elsewhere?

like image 475
Dan Tao Avatar asked Aug 07 '13 19:08

Dan Tao


People also ask

Which data structure is best for lookup?

What is the most efficient data structure for this? Looking at complexity analysis, Hashtables seem to be the most efficient for lookups.

What is pattern based RegEx?

A regex pattern matches a target string. The pattern is composed of a sequence of atoms. An atom is a single point within the regex pattern which it tries to match to the target string. The simplest atom is a literal, but grouping parts of the pattern to match an atom will require using ( ) as metacharacters.

What data type is RegEx?

The Regexp data type The data type of regular expressions is Regexp . By default, Regexp matches any regular expression value. If you are looking for a type that matches strings which match arbitrary regular expressions, see the Pattern type. You can use parameters to restrict which values Regexp will match.

How do I create a pattern in RegEx?

The Escape Symbol : \' etc characters, add a backslash( \ ) before that character. This will tell the computer to treat the following character as a search character and consider it for matching pattern. Example : \d+[\+-x\*]\d+ will match patterns like "2+2" and "3*9" in "(2+2) * 3*9".


2 Answers

The fgrep tool does exactly what you're talking about: matches text against multiple regular expressions. My understanding is that the original version used something very similar to the Aho-Corasick string matching algorithm to search multiple regular expressions in a single pass. Basically, it created a DFA and ran through it.

I do not know of a .NET implementation of fgrep. If you find one, I'd certainly be interested to hear about it.

You might track down the fgrep source code (Google for it, there are lots of sources) and see how it's implemented.

Alternatively, you could have your program shell out to fgrep. Or perhaps create a C++ DLL that has an fgrep entry point that you could call from your C# program.

If your multiple patterns are constant strings (i.e. not regular expressions), then you might be interested in my C# implementation of the Aho-Corasick algorithm.

like image 125
Jim Mischel Avatar answered Sep 27 '22 16:09

Jim Mischel


Let's assume that these regular expressions are truly regular. Each can then be converted into a Nondeterministic Finite Automaton, which can be converted into a Deterministic Finite Automaton, which can be evaluated in O(n) time in the length of the input.

But it doesn't address the question of matching multiple regexps at the same time. We can do this by creating a single regexp that looks like this: (regexp1|regexp2|...), and turn that into a single NFA/DFA. Add some instrumentation to the branches of the automaton to keep track of which particular regexp produced the path that matched the input, and you've got your matcher, still O(n) in the length of the input string.

This technique would not support any "regex" features that make the language non-regular, such as backreferences.

Another drawback is that the resulting DFA could be large. It is also possible to evaluate the NFA directly, which is probably slower but has better memory behaviour.


Actually it's pretty easy to express this idea in code as well, without worrying about the automaton stuff. Just use matching groups:

combined_regexp = (regexp1)|(regexp2)|...

At evaluation time, just see which group matched the input.

Keep in mind that most regex implementations/libraries have pretty poor behaviour in some corner cases, where they can take exponential time to compile or match the regexp. I'm not sure how much of a problem that is in practice. Google's RE2 library is one that was specifically designed not to have such pathological behaviour, but there might be others.

Another problem could be that, unless your regexp implementation specifically advertises O(n) behaviour, it might just try each of the alternatives in turn. In that case this approach wouldn't buy you anything.

like image 34
Thomas Avatar answered Sep 27 '22 18:09

Thomas