Is it possible for a computer to "learn" a regular expression by user-provided examples?

Tags:

Is it possible for a computer to "learn" a regular expression by user-provided examples?

To clarify:

I do not want to learn regular expressions.
I want to create a program which "learns" a regular expression from examples which are interactively provided by a user, perhaps by selecting parts from a text or selecting begin or end markers.

Is it possible? Are there algorithms, keywords, etc. which I can Google for?

EDIT: Thank you for the answers, but I'm not interested in tools which provide this feature. I'm looking for theoretical information, like papers, tutorials, source code, names of algorithms, so I can create something for myself.

962

asked Mar 05 '09 19:03

Daniel Rikowski

2 Answers

Yes, it is possible, we can generate regexes from examples (text -> desired extractions). This is a working online tool which does the job: http://regex.inginf.units.it/

Regex Generator++ online tool generates a regex from provided examples using a GP search algorithm. The GP algorithm is driven by a multiobjective fitness which leads to higher performance and simpler solution structure (Occam's Razor). This tool is a demostrative application by the Machine Lerning Lab, Trieste Univeristy (Università degli studi di Trieste). Please look at the video tutorial here.

This is a research project so you can read about used algorithms here.

Behold! :-)

Finding a meaningful regex/solution from examples is possible if and only if the provided examples describe the problem well. Consider these examples that describe an extraction task, we are looking for particular item codes; the examples are text/extraction pairs:

"The product code is 467-345A" -> "467-345A" "The item 789-345B is broken"  -> "789-345B"

An (human) guy, looking at the examples, may say: "the item codes are things like \d++-345[AB]"

When the item code is more permissive but we have not provided other examples, we have not proofs to understand the problem well. When applying the human generated solution \d++-345[AB] to the following text, it fails:

"On the back of the item there is a code: 966-347Z"

You have to provide other examples, in order to better describe what is a match and what is not a desired match: --i.e:

"My phone is +39-128-3905 , and the phone product id is 966-347Z" -> "966-347Z"

The phone number is not a product id, this may be an important proof.

168

answered Oct 04 '22 13:10

Fabiano Tarlao

The book An Introduction to Computational Learning Theory contains an algorithm for learning a finite automaton. As every regular language is equivalent to a finite automaton, it is possible to learn some regular expressions by a program. Kearns and Valiant show some cases where it is not possible to learn a finite automaton. A related problem is learning hidden Markov Models, which are probabilistic automata that can describe a character sequence. Note that most modern "regular expressions" used in programming languages are actually stronger than regular languages, and therefore sometimes harder to learn.

answered Oct 04 '22 13:10

Yuval F

Related questions
                            
                                escaping question mark in regex javascript
                            
                                RegExp.exec() returns NULL sporadically
                            
                                How do HTML parses work if they're not using regexp?
                            
                                Delimiter must not be alphanumeric or backslash and preg_match
                            
                                Regex to validate JSON
                            
                                JavaScript replace \n with <br /> [duplicate]
                            
                                Regex for converting CamelCase to camel_case in java
                            
                                Regular Expression Match to test for a valid year
                            
                                How to check that a string is a palindrome using regular expressions?
                            
                                Regular expression containing one word or another
                            
                                JavaScript - string regex backreferences
                            
                                Using sed and grep/egrep to search and replace
                            
                                grepping using the "|" alternative operator
                            
                                Which regular expression operator means 'Don't' match this character?
                            
                                Javascript regular expression: remove first and last slash
                            
                                Python regex - r prefix
                            
                                Get the index of a pattern in a string using regex
                            
                                Remove part of a string
                            
                                Is there a difference between /\s/g and /\s+/g?
                            
                                Validate email address in Dart? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is it possible for a computer to "learn" a regular expression by user-provided examples?

Tags:

regex

artificial-intelligence

theory

automata

Daniel Rikowski

People also ask

2 Answers

Fabiano Tarlao

Yuval F

Recent Activity

Donate For Us