Building a Regex Composer

Tags:

algorithm

I was reading the Java project idea described here:

The user gives examples of what he wants and does not want to match. The program tries to deduce a regex that fits the examples. Then it generates examples that would fit and not fit. The user corrects the examples it got wrong, and it composes a new regex. Iteratively, you get a regex that is close enough to what you need.

This sounds like a really interesting idea to me. Does anyone has an idea as to how to do this? My first idea was something like a genetic algorithm, but I would love some input from you guys.

979

asked Sep 06 '10 13:09

2 Answers

Actually, this starts to look more and more like a compiler application. In fact, if I remember correctly, the Aho Dragon compiler book uses a regex example to build a DFA compiler. That's the place to start. This could be a really cool compiler project.

If that is too much, you can approach it as an optimization in several passes to refine it further and further, but it will be all predefined algo's at first:

First pass: Want to match Cat, Catches cans Result: /Cat|Catches|Cans/

Second Pass: Look for similar starting conditions: Result: /Ca(t|tches|ans)/

Second Pass: Look for similar ending conditions: Result: /Ca(t|tch|an)s*/

Third Pass: Look for more refinements like repetitions and negative conditions

answered Oct 12 '22 21:10

SDGator

There exists algorithm that does exactly this for positive examples.

Regular expression are equivalent to DFA (Deterministic Finite Automata). The strategie is mostly always the same:

Look at Alergia (for the theory) and MDI algorithm (for real usage) if generate an Deterministic Automaton is enough.

The Alergia algorithm and MDI are both described here: http://www.info.ucl.ac.be/~pdupont/pdupont/pdf/icml2k.pdf

If you want to generate smaller models you can use another algorithm. The article describing it is here: http://www.grappa.univ-lille3.fr/~lemay/publi/TCS02.ps.gz

His homepage is here: http://www.grappa.univ-lille3.fr/~lemay

If you want to use negative example, I suggest you to use a simple rule (coloring) that prevent two states of the DFA to be merged.

If you ask these people, I am sure they will share their code source.

I made the same kind of algorithm during my Ph.D. for probabilistic automata. That means, you can associate a probability to each string, and I have made a C++ program that "learn" Weighted Automata.

Mainly these algorithm work like that:

from positive examples: {abb, aba, abbb}

create the simplest DFA that accept exactly all these examples:

-> x -- a --> (y) -- b --> (z)
          \
            b --> t -- b --> (v)

x cant got to state y by reading the letter 'a' for example. The states are x, y, z, t and v. (z) means it is a finite state.

then "merge" states of the DFA: (here for example the result after merging states y and t.

               _
              /  \
             v    | a,b     ( <- this is a loop :-) )
x -- a -> (y,t) _/

the new state (y,t) is a terminal state obtaining by merging y and t. And you can read the letter a and b from it. Now the DFA can accept: a(a|b)* and it is easy to construct the regular expression from the DFA.

Which states to merge is a choice that makes the main difference between algorithms.

answered Oct 12 '22 22:10

yogsototh

Related questions
                            
                                What is a good hash function for a collection (i.e., multi-set) of integers?
                            
                                draw outside of UIView's bounds from drawRect
                            
                                Vim: How to set up an efficient autocomplete configuration?
                            
                                Getting started with Django on Amazon Web Services [closed]
                            
                                Remote Service Vs. Local Service
                            
                                CALayerInvalidGeometry', reason: 'CALayer bounds contains NaN: [0 0; nan nan] crash in view
                            
                                How can I use Multiple GLSurfaceView components in the same Layout?
                            
                                How to influence search path of System.loadLibrary() through Java code?
                            
                                How to attach a maven plugin to a phase by default?
                            
                                JPA Entities and/vs DTOs
                            
                                Are there any useful tools for diagnosing Qt layout and spacing problems?
                            
                                Parse double precision IEEE floating-point on a C compiler with no double precision type

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Building a Regex Composer

Tags:

regex

algorithm

Amir Rachum

People also ask

2 Answers

SDGator

yogsototh

Recent Activity

Donate For Us