Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Techniques for extracting regular expressions out of a labeled data set

Let's suppose I have a data set of several hundred thousand strings (which happen to be natural language sentences, if it matters) which are each tagged with a certain "label". Each sentence is tagged with exactly one label, and there are about 10 labels, each with approximately 10% of the data set belonging to them. There is a high degree of similarity to the structure of sentences within a label.

I know the above sounds like a classical example of a machine learning problem, but I want to ask a slightly different question. Are there any known techniques for programatically generating a set of regular expressions for each label, which can successfully classify the training data while still generalizing to future test data?

I would be very happy with references to the literature; I realize that this will not be a straightforward algorithm :)

PS: I know that the normal way to do classification is with machine learning techniques like an SVM or such. I am, however, explicitly looking for a way to generate regular expressions. (I would be happy with with machine learning techniques for generating the regular expressions, just not with machine learning techniques for doing the classification itself!)

like image 919
Adrian Petrescu Avatar asked May 24 '12 15:05

Adrian Petrescu


People also ask

What is extract regex?

Extracts the first matching substrings according to a regular expression.

What is regular expression in data visualization?

In simple, a regular expression is a sequence of characters that defines the search pattern which is used for the purpose of string matching, searching substring, finding the pattern, and find and replace operation, etc.

Which string methods are used with regular expression?

Regular expressions are used with the RegExp methods test() and exec() and with the String methods match() , replace() , search() , and split() .


2 Answers

This problem is usually framed as how to generate finite automata from sets of strings, rather than regular expressions, though you can obviously generate REs from FAs since they are equivalent.

If you search around for automata induction, you should be able to find quite a lot of literature on this topic, including GA approaches.

like image 57
Fred Foo Avatar answered Sep 28 '22 18:09

Fred Foo


So far as I know, this is the subject of current research in evolutionary computation.

Here are some examples:

See slides 40-44 at

https://cs.byu.edu/sites/default/files/Ken_De_Jong_slides.pdf (slides exist as of the posting of this answer).

Also, see

http://www.citeulike.org/user/bartolialberto/article/10710768

for a more detailed review of a system presented at GECCO 2012.

like image 41
SyAbleton Avatar answered Sep 28 '22 18:09

SyAbleton