I was reading the Java project idea described here:
The user gives examples of what he wants and does not want to match. The program tries to deduce a regex that fits the examples. Then it generates examples that would fit and not fit. The user corrects the examples it got wrong, and it composes a new regex. Iteratively, you get a regex that is close enough to what you need.
This sounds like a really interesting idea to me. Does anyone has an idea as to how to do this? My first idea was something like a genetic algorithm, but I would love some input from you guys.
If you want to match for the actual '+', '. ' etc characters, add a backslash( \ ) before that character. This will tell the computer to treat the following character as a search character and consider it for matching pattern. Example : \d+[\+-x\*]\d+ will match patterns like "2+2" and "3*9" in "(2+2) * 3*9".
A regular expression (regex, regexp or rational expression) is a sequence of characters that specifies a search pattern. Usually these patterns are used by string-searching algorithms for find or find and replace operations on strings, or for input validation.
Using Regex in VimMany of Vim's search features allow you to use regular expressions. Some of them are: The search commands ( / and ? ) The global and substitute command-line (ex) commands ( :g and :s )
Actually, this starts to look more and more like a compiler application. In fact, if I remember correctly, the Aho Dragon compiler book uses a regex example to build a DFA compiler. That's the place to start. This could be a really cool compiler project.
If that is too much, you can approach it as an optimization in several passes to refine it further and further, but it will be all predefined algo's at first:
First pass: Want to match Cat, Catches cans Result: /Cat|Catches|Cans/
Second Pass: Look for similar starting conditions: Result: /Ca(t|tches|ans)/
Second Pass: Look for similar ending conditions: Result: /Ca(t|tch|an)s*/
Third Pass: Look for more refinements like repetitions and negative conditions
There exists algorithm that does exactly this for positive examples.
Regular expression are equivalent to DFA (Deterministic Finite Automata). The strategie is mostly always the same:
Look at Alergia (for the theory) and MDI algorithm (for real usage) if generate an Deterministic Automaton is enough.
The Alergia algorithm and MDI are both described here: http://www.info.ucl.ac.be/~pdupont/pdupont/pdf/icml2k.pdf
If you want to generate smaller models you can use another algorithm. The article describing it is here: http://www.grappa.univ-lille3.fr/~lemay/publi/TCS02.ps.gz
His homepage is here: http://www.grappa.univ-lille3.fr/~lemay
If you want to use negative example, I suggest you to use a simple rule (coloring) that prevent two states of the DFA to be merged.
If you ask these people, I am sure they will share their code source.
I made the same kind of algorithm during my Ph.D. for probabilistic automata. That means, you can associate a probability to each string, and I have made a C++ program that "learn" Weighted Automata.
Mainly these algorithm work like that:
from positive examples: {abb, aba, abbb}
create the simplest DFA that accept exactly all these examples:
-> x -- a --> (y) -- b --> (z)
\
b --> t -- b --> (v)
x cant got to state y by reading the letter 'a' for example. The states are x, y, z, t and v. (z) means it is a finite state.
then "merge" states of the DFA: (here for example the result after merging states y and t.
_
/ \
v | a,b ( <- this is a loop :-) )
x -- a -> (y,t) _/
the new state (y,t) is a terminal state obtaining by merging y and t. And you can read the letter a and b from it. Now the DFA can accept: a(a|b)* and it is easy to construct the regular expression from the DFA.
Which states to merge is a choice that makes the main difference between algorithms.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With