Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Database or structure suitable for matching strings to regex patterns

I have a number of regex patterns. When a string is input I have to find all the patterns match this string. This is normally an O(n) operation:

SELECT regex FROM regexes WHERE 'string' RLIKE regex

What is the fastest way to do this? Are there database structures or storage systems that are optimized to do such an operation?

like image 562
gpilotino Avatar asked Mar 05 '09 09:03

gpilotino


People also ask

How do you match in regex?

2.1 Matching a Single Character The fundamental building blocks of a regex are patterns that match a single character. Most characters, including all letters ( a-z and A-Z ) and digits ( 0-9 ), match itself. For example, the regex x matches substring "x" ; z matches "z" ; and 9 matches "9" .

What are regex patterns?

A regular expression is a pattern that the regular expression engine attempts to match in input text. A pattern consists of one or more character literals, operators, or constructs.

What are some use cases of regular expressions?

Regular expressions are particularly useful for defining filters. Regular expressions contain a series of characters that define a pattern of text to be matched—to make a filter more specialized, or general.

What are different types of regular expression?

There are also two types of regular expressions: the "Basic" regular expression, and the "extended" regular expression.


2 Answers

The short answer is 'No.' There is no index structure currently available on any DBMS platform that will index partial matches of a regex like this.

The long answer is that a leading constant on a wildcard match (e.g. 'foo_') can be used as a prefix for index matches. Many DBMS platforms will optimise this and use an index (if available) to resolve the prefix. However, this is not anything like as clever as a full regex, and the indexing can only be used if you have a constant prefix.

The even longer answer is that there are algorithms such as RETE that will optimise partial matches like this. This might be applicable if you can express your matches as forward-chaining production rules rather than regular expressions.

Rete works by computing partial matches and only presenting rules that can be reached from this partial match, so it is more efficient than O(n) (more like O(log n) but I'm not sure of the exact time complexity) for matching n rules against a fact.

like image 105
ConcernedOfTunbridgeWells Avatar answered Oct 11 '22 13:10

ConcernedOfTunbridgeWells


One optimisation you could implement, if applicable to your case, would be to categorise your Regexes and organise them in hierarchies so that:

  • you only have to test a handful of most-general Regexes.

  • for any general regex that matches, then proceed to test the string against all regexes of the same category only.

For instance, if your input strings can be anything arbitrarily complex and you have thousands of regexes, you could organise them in categories like:

  • the \d+ category, which would test number patterns (SSN, telephone numbers etc)

  • the <.*?> category, which would test the presence of HTML tags

  • the \w+@\w+ category, which could test the presence of emails addresses

etc.

If any root pattern doesn't match, then you avoid having to test whole ranges of patterns that would fail anyway.

Don't know if that would match your exact domain, but it's a possible optimisation.

like image 27
Renaud Bompuis Avatar answered Oct 11 '22 15:10

Renaud Bompuis