Is there a data structure that provides lookup by pattern (regex)?

Tags:

I've run into this situation several times: there are multiple patterns some text may match against, and you want to do something specific based on which pattern it is.

In the past I've always just used a list of regular expressions and iterated until finding a match.

What I'm wondering is if there is a more efficient data structure for this. Something like, if I were using C# for example, a Dictionary with Regex keys.

I realize that if the patterns are all prefixes or suffixes, then something like a Trie would make sense. It's not clear to me that this would work for the general case, though.

It also seems to me there could be some ambiguity here around key collisions; e.g., if some text matches more than one pattern, what should be returned? (I would think that maybe a non-deterministic result would be OK in that case; but as long as the behavior were documented I'd be fine with it.)

Anyway, does such a data structure exist, either in .NET or elsewhere?

475

asked Aug 07 '13 19:08

Dan Tao

2 Answers

The fgrep tool does exactly what you're talking about: matches text against multiple regular expressions. My understanding is that the original version used something very similar to the Aho-Corasick string matching algorithm to search multiple regular expressions in a single pass. Basically, it created a DFA and ran through it.

I do not know of a .NET implementation of fgrep. If you find one, I'd certainly be interested to hear about it.

You might track down the fgrep source code (Google for it, there are lots of sources) and see how it's implemented.

Alternatively, you could have your program shell out to fgrep. Or perhaps create a C++ DLL that has an fgrep entry point that you could call from your C# program.

If your multiple patterns are constant strings (i.e. not regular expressions), then you might be interested in my C# implementation of the Aho-Corasick algorithm.

125

answered Sep 27 '22 16:09

Jim Mischel

Let's assume that these regular expressions are truly regular. Each can then be converted into a Nondeterministic Finite Automaton, which can be converted into a Deterministic Finite Automaton, which can be evaluated in O(n) time in the length of the input.

But it doesn't address the question of matching multiple regexps at the same time. We can do this by creating a single regexp that looks like this: (regexp1|regexp2|...), and turn that into a single NFA/DFA. Add some instrumentation to the branches of the automaton to keep track of which particular regexp produced the path that matched the input, and you've got your matcher, still O(n) in the length of the input string.

This technique would not support any "regex" features that make the language non-regular, such as backreferences.

Another drawback is that the resulting DFA could be large. It is also possible to evaluate the NFA directly, which is probably slower but has better memory behaviour.

Actually it's pretty easy to express this idea in code as well, without worrying about the automaton stuff. Just use matching groups:

Click to copy

combined_regexp = (regexp1)|(regexp2)|...

At evaluation time, just see which group matched the input.

Keep in mind that most regex implementations/libraries have pretty poor behaviour in some corner cases, where they can take exponential time to compile or match the regexp. I'm not sure how much of a problem that is in practice. Google's RE2 library is one that was specifically designed not to have such pathological behaviour, but there might be others.

Another problem could be that, unless your regexp implementation specifically advertises O(n) behaviour, it might just try each of the alternatives in turn. In that case this approach wouldn't buy you anything.

answered Sep 27 '22 18:09

Thomas

Related questions
                            
                                How to make SWIG deal with utf8 strings in C#?
                            
                                C# - is this good practice to simplify exceptions generated by System.IO.File.ReadAllText
                            
                                Get collection of methods with the same name
                            
                                Why is a .NET unit test covering a Parallel.foreach loop hardware-dependent?
                            
                                2.0 version of System.Management.Automation?
                            
                                Helper to generate "friendly URL" in Razor (C#/MVC 4)
                            
                                In C#, is it possible to open a URL in the background, without opening a browser?
                            
                                Google Admin Directory User Error
                            
                                Bing maps Polygon Search is not accurate
                            
                                NullReferenceException when inserting with Dapper
                            
                                Encoding of BrokeredMessage Body in Azure Service Bus
                            
                                Trouble running SSIS package programmatically and from command line (DTEXEC) [duplicate]
                            
                                Reference a Set of File Paths Using a Regular Expression
                            
                                simulate infinite scrolling in c# to get full html of a page
                            
                                How to ignore unobserved exceptions with async/await in MonoTouch?
                            
                                T4 Assembly Directive with Relative Path in Website Project?
                            
                                Check if PasswordVault/credential manager has app data at load
                            
                                Ninject binding based on string
                            
                                401 Unauthorized error web api mvc windows authentication
                            
                                Security script based on Global Group?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there a data structure that provides lookup by pattern (regex)?

Tags:

c#

.net

regex

data-structures

Dan Tao

People also ask

2 Answers

Jim Mischel

Thomas

Recent Activity

Donate For Us