Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grammatical inference of regular expressions for given finite list of representative strings?

Tags:

I'm working on analyzing a large public dataset with lots of verbose human-readable strings that were clearly generated by some regular (in the formal language theory sense) grammar.

It's not too hard to look at sets of these strings one by one to see the patterns; unfortunately, there's about 24,000 of these unique strings broken up into 33 categories and 1714 subcategories, so it's somewhat painful to do this manually.

Basically, I'm looking for an existing algorithm (preferably with an existing reference implementation) to take an arbitrary list of strings and try to infer some minimal (for some reasonable definition of minimal) spanning set of regular expressions that can be used to generate them (i.e. infer a regular grammar from a finite set of strings from the language generated by that grammar).

I've considered doing repeated greedy longest common substring elimination, but that only goes so far because it won't collapse anything but exact matches, so won't detect, say, a common pattern of varying numerical strings at a particular position in the grammar.

Brute forcing anything that doesn't fall out of common substring elimination is possible, but probably computationally unfeasible. (Furthermore, I've thought about it and there might be a "phase ordering" and/or "local minimum" issue with substring elimination, since you might make a greedy substring match that ends up forcing the final grammar to be less compressed/minimal even though it appears to be the best reduction initially).

like image 285
Stephen Lin Avatar asked Mar 20 '13 00:03

Stephen Lin


People also ask

What is the regular expression that describes the set of strings?

a|b* denotes {ε, "a", "b", "bb", "bbb", ...} (a|b)* denotes the set of all strings with no symbols other than "a" and "b", including the empty string: {ε, "a", "b", "aa", "ab", "ba", "bb", "aaa", ...}

What is the regular expression for a language containing all the strings with any number of A's and B's?

Write the regular expression for the language accepting all the string containing any number of a's and b's. Solution: The regular expression will be: r.e. = (a + b)*

How do you find regular expression from regular grammar?

if the regular expression is simply 0, we can show that G, with no production rules, is an equivalent regular grammar. if the regular expression is simply 1, we can show that G, with one production rule S (where S is the start symbol), is an equivalent regular grammar.

How many distinct strings are in the language of the regular expression?

To count the number of strings it matches without duplicates we can count the number of length 1, length 2, length 3 and length 4 strings it matches (and add them). Each of these are 2^1, 2^2, 2^3, 2^3 so the sum is 2^4-1 = 31.


1 Answers

Yes, it turns out this does exist; what is required is what is known academically as a DFA Learning algorithm, examples of which include:

  • Angluin's L*
  • L* (adding counter-examples to columns)
  • Kearns / Vazirani
  • Rivest / Schapire
  • NL*
  • Regular positive negative inference (RPNI)
  • DeLeTe2
  • Biermann & Feldman's algorithm
  • Biermann & Feldman's algorithm (using SAT-solving)

Source for the above is libalf, an open-source automata learning algorithm framework in C++; descriptions of at least some of these algorithms can be found in this textbook, among others. There are also implementations of grammatical inference algorithms (including DFA learning) in gitoolbox for MATLAB.

Since this question has come up before and has not been satisfactorily answered in the past, I am in the process of evaluating these algorithms and will update will more information about how useful they are, unless someone with more expertise in the area does first (which is preferable).

NOTE: I am accepting my own answer for now but will gladly accept a better one if someone can provide one.

FURTHER NOTE: I've decided to go with the route of using custom code, since using a generic algorithm turns out to be a bit overkill for the data I'm working with. I'm leaving this answer here in case someone else needs it, and will update if I ever do evaluate these.

like image 183
Stephen Lin Avatar answered Oct 05 '22 22:10

Stephen Lin