Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Building a Regex Composer

I was reading the Java project idea described here:

The user gives examples of what he wants and does not want to match. The program tries to deduce a regex that fits the examples. Then it generates examples that would fit and not fit. The user corrects the examples it got wrong, and it composes a new regex. Iteratively, you get a regex that is close enough to what you need.

This sounds like a really interesting idea to me. Does anyone has an idea as to how to do this? My first idea was something like a genetic algorithm, but I would love some input from you guys.

like image 979
Amir Rachum Avatar asked Sep 06 '10 13:09

Amir Rachum


People also ask

How do you create expressions in regex?

If you want to match for the actual '+', '. ' etc characters, add a backslash( \ ) before that character. This will tell the computer to treat the following character as a search character and consider it for matching pattern. Example : \d+[\+-x\*]\d+ will match patterns like "2+2" and "3*9" in "(2+2) * 3*9".

What is regex generator?

A regular expression (regex, regexp or rational expression) is a sequence of characters that specifies a search pattern. Usually these patterns are used by string-searching algorithms for find or find and replace operations on strings, or for input validation.

Does vim use regex?

Using Regex in VimMany of Vim's search features allow you to use regular expressions. Some of them are: The search commands ( / and ? ) The global and substitute command-line (ex) commands ( :g and :s )


2 Answers

Actually, this starts to look more and more like a compiler application. In fact, if I remember correctly, the Aho Dragon compiler book uses a regex example to build a DFA compiler. That's the place to start. This could be a really cool compiler project.

If that is too much, you can approach it as an optimization in several passes to refine it further and further, but it will be all predefined algo's at first:

First pass: Want to match Cat, Catches cans Result: /Cat|Catches|Cans/

Second Pass: Look for similar starting conditions: Result: /Ca(t|tches|ans)/

Second Pass: Look for similar ending conditions: Result: /Ca(t|tch|an)s*/

Third Pass: Look for more refinements like repetitions and negative conditions

like image 84
SDGator Avatar answered Oct 12 '22 21:10

SDGator


There exists algorithm that does exactly this for positive examples.

Regular expression are equivalent to DFA (Deterministic Finite Automata). The strategie is mostly always the same:

Look at Alergia (for the theory) and MDI algorithm (for real usage) if generate an Deterministic Automaton is enough.

The Alergia algorithm and MDI are both described here: http://www.info.ucl.ac.be/~pdupont/pdupont/pdf/icml2k.pdf

If you want to generate smaller models you can use another algorithm. The article describing it is here: http://www.grappa.univ-lille3.fr/~lemay/publi/TCS02.ps.gz

His homepage is here: http://www.grappa.univ-lille3.fr/~lemay

If you want to use negative example, I suggest you to use a simple rule (coloring) that prevent two states of the DFA to be merged.

If you ask these people, I am sure they will share their code source.

I made the same kind of algorithm during my Ph.D. for probabilistic automata. That means, you can associate a probability to each string, and I have made a C++ program that "learn" Weighted Automata.

Mainly these algorithm work like that:

from positive examples: {abb, aba, abbb}

create the simplest DFA that accept exactly all these examples:

-> x -- a --> (y) -- b --> (z)
          \
            b --> t -- b --> (v)

x cant got to state y by reading the letter 'a' for example. The states are x, y, z, t and v. (z) means it is a finite state.

then "merge" states of the DFA: (here for example the result after merging states y and t.

               _
              /  \
             v    | a,b     ( <- this is a loop :-) )
x -- a -> (y,t) _/

the new state (y,t) is a terminal state obtaining by merging y and t. And you can read the letter a and b from it. Now the DFA can accept: a(a|b)* and it is easy to construct the regular expression from the DFA.

Which states to merge is a choice that makes the main difference between algorithms.

like image 4
yogsototh Avatar answered Oct 12 '22 22:10

yogsototh