Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression generator/reducer?

I was posed an interesting question from a colleague for an operational pain point we currently have, and am curious if there's anything out there (utility/library/algorithm) that might help automate this.

Say you have a list of literal values (in our cases, they are URLs). What we want to do is, based on this list, come up with a single regex that matches all of those literal items.

So, if my list is:

http://www.example.com
http://www.example.com/subdir
http://foo.example.com

The simplest answer is

^(http://www.example.com|http://www.example.com/subdir|http://foo.example.com)$

but this gets large for lots of data, and we have a length limit we're trying to stay under.

Currently we manually write the regexes but this doesn't scale very well nor is it a great use of anyone's time. Is there a more automated way of decomposing the source data to come up with a length-optimal regex that matches all of the source values?

like image 870
Joe Avatar asked Jul 07 '10 15:07

Joe


People also ask

What is regex generator?

A regular expression (regex, regexp or rational expression) is a sequence of characters that specifies a search pattern. Usually these patterns are used by string-searching algorithms for find or find and replace operations on strings, or for input validation.

How do you create expressions in regex?

If you want to match for the actual '+', '. ' etc characters, add a backslash( \ ) before that character. This will tell the computer to treat the following character as a search character and consider it for matching pattern. Example : \d+[\+-x\*]\d+ will match patterns like "2+2" and "3*9" in "(2+2) * 3*9".

What does * do in regex?

This operator is similar to the match-zero-or-more operator except that it repeats the preceding regular expression at least once; see section The Match-zero-or-more Operator ( * ), for what it operates on, how some syntax bits affect it, and how Regex backtracks to match it.


1 Answers

The Aho-Corasick matching algorithm constructs a finite automaton to match multiple strings. You could convert the automaton to its equivalent regex but it is simpler to use the automaton directly (this is what the algorithm does.)

like image 195
Rafał Dowgird Avatar answered Oct 17 '22 15:10

Rafał Dowgird