Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

creating a regular expression for a list of strings

Tags:

regex

I have extracted a series of tables from the scientific literature which consist of columns each of which is a distinct type. Here is an exampletable of values

I'd like to be able to automatically generate regular expressions for each column. Obviously there are trivial solutions such as .* so I would add the constraints that they use only:

  • [A-Z] [a-z] [0-9]
  • explicit punctuation (e.g. ',',''')
  • "simple" quantifiers (e.g {3,4}

A "best" answer for the table above would be:

 [A-Z]{3}
 [A-Za-z\s\.]+
 \d{4}\sm
 \d{2}\u00b0\d{2}'\d{2}"N,\d{2}\u00b0\d{2}'\d{2}"E
 (speciosissima|intermediate|troglodytes)
 (hf|sr)
 \d{4}

Of course the 4th regex would break if we move outside the geographical area but the software doesn't know that. The aim would be to collect many regexes for , say "Coordinates" and generalize them, probably partially manual. The enums would only be created if there were a small number of distinct strings.

I'd be grateful for examples of (especially F/OSS) software that can do this, especially in Java. (It's similar to Google's Refine). I am aware of this question 4 years ago but that didn't really answer the question and the text2re site which appears to be interactive.

NOTE: I note a vote to close as "too localised". This is a very general problem (the table given is only an example) as shown by Google/Freebase developing Refine to tackle the problem. It potentially refers to a very wide variety of tables (e.g. financial, journalism, etc.). Here's one with floating point values: enter image description here

It would be useful to determine automatically that some authorities report ages in real numbers (e.g. not months, days) and use 2 digits of precision.

like image 709
peter.murray.rust Avatar asked Oct 05 '22 04:10

peter.murray.rust


2 Answers

Your particular issue is a special case of "programming by demonstration". That is, given a bunch of input/output examples, you want to generate a program. For you, the inputs are strings and the output is whether each string belongs to the given column. In the end, you want to generate a program in the language of limited regular expressions that you proposed.

This particular instance of programming by demonstration seems closely related to Flash Fill, a recent project from MSR. There, instead of generating regular expressions to match data, they automatically generated programs to transform string data based on input/output examples.

I only skimmed through one of their papers, but I'll try to lay out what I understand here.

There are basically two important insights in this paper. The first was to design a small programming language to represent string transformations. Even using full-on regular expressions created too many possibilities to search through quickly. They designed their own abstract language for manipulating strings; however, your constraints (e.g. only using simple quantifiers) would probably play the same role as their custom language. This is largely possible because your particular problem has a somewhat smaller scope than theirs.

The second insight was on how to actually find programs in this abstract language that match with given input/output pairs. My understanding is that the key idea here is to use a technique called version space algebra. The rough idea about version space algebra is that you maintain a representation of the space of possible programs and repeatedly prune it by introducing additional constraints. The exact details of this process fall well outside my main interests, so you're better off reading something like this introduction to version space algebra, which includes some sample code as well.

They also have some clever approaches to rank different candidate programs and even guess which inputs might be problematic for an already-generated program. I saw a demo where they generated a program without giving it enough input/output pairs, and the program could actually highlight new inputs that were likely to be incorrect. This sort of ranking is very interesting, but requires some more sophisticated machine learning techniques and is probably not immediately applicable to your use case. Might still be interesting though. (Also, this might have been detailed in a different paper than the one I linked.)

So yeah, long story short, you can generate your expressions by feeding input/output examples into a system based on version space algebra. I hope that helps.

like image 161
Tikhon Jelvis Avatar answered Oct 13 '22 12:10

Tikhon Jelvis


I'm currently researching the same (or something similar) (here). In general, this is called Grammar induction, or in case of regular expressions, it is induction of regular languages. There is the StaMinA competition about this field. Common algorithms are RPNI and Blue-Fringe.

Here is another related question. And here another one. And here another one.

like image 30
Albert Avatar answered Oct 13 '22 12:10

Albert