Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to auto generate regex from given list of strings? [closed]

You are given 2 lists of Strings - A and B. Find the shortest regex that matches all strings in A and none in B. Note that this regex can match/not-match other strings that are not in A and not in B. For simplicity, we can assume the that our alphabet size is just 2 characters - 0 and 1. Also only these operators are allowed:

* - 0 or more
? - 0 or 1
+ - 1 or more
() - brackets

For simplicity the regex not operator is not allowed. I don't know if allowing the or operator (|) would simplify the problem or not. A and B ofcourse would have no common elements. Here are some examples:

A=[00,01,10] B=[11] answer = 1*0+1* 


A=[00,01,11] B=[10] answer = 0*1* 
like image 767
pathikrit Avatar asked Feb 02 '11 22:02

pathikrit


People also ask

What method regex returns a list of strings containing all matches?

Regex's findall() function is extremely useful as it returns a list of strings containing all matches.

How do you create expressions in regex?

If you want to match for the actual '+', '. ' etc characters, add a backslash( \ ) before that character. This will tell the computer to treat the following character as a search character and consider it for matching pattern. Example : \d+[\+-x\*]\d+ will match patterns like "2+2" and "3*9" in "(2+2) * 3*9".

What is regex generator?

A regular expression (regex, regexp or rational expression) is a sequence of characters that specifies a search pattern. Usually these patterns are used by string-searching algorithms for find or find and replace operations on strings, or for input validation.

What does \f mean in regex?

\f stands for form feed, which is a special character used to instruct the printer to start a new page.


1 Answers

One way to solve this is with a genetic algorithm. I happen to have a genetic solver laying around so I applied it to your problem with the following algorithm:

  • get the distinct tokens from the desired inputs as genes
  • add the Regex specials to the genes
  • for the fitness algorithm
    • make sure the generated string is a valid regex
    • get a fitness value based on how many desired things it matches and how many undesired things it matches
  • until a successful Regex is found
    • starting at the number of distinct tokens and incrementing as necessary
    • try to generate a Regex of that length that passes the fitness requirement

Here's my implementation in C#

private static void GenerateRegex(IEnumerable<string> target, IEnumerable<string> dontMatch) {     string distinctSymbols = new String(target.SelectMany(x => x).Distinct().ToArray());     string genes = distinctSymbols + "?*()+";      Func<string, uint> calcFitness = str =>         {             if (str.Count(x => x == '(') != str.Count(x => x == ')'))             {                 return Int32.MaxValue;             }             if ("?*+".Any(x => str[0] == x))             {                 return Int32.MaxValue;             }             if ("?*+?*+".ToArray().Permute(2)                 .Any(permutation => str.IndexOf(new string(permutation.ToArray())) != -1))             {                 return Int32.MaxValue;             }             Regex regex;             try             {                 regex = new Regex("^" + str + "$");             }             catch (Exception)             {                 return Int32.MaxValue;             }             uint fitness = target.Aggregate<string, uint>(0, (current, t) => current + (regex.IsMatch(t) ? 0U : 1));             uint nonFitness = dontMatch.Aggregate<string, uint>(0, (current, t) => current + (regex.IsMatch(t) ? 10U : 0));             return fitness + nonFitness;         };      for (int targetGeneLength = distinctSymbols.Length; targetGeneLength < genes.Length * 2; targetGeneLength++)     {         string best = new GeneticSolver(50).GetBestGenetically(targetGeneLength, genes, calcFitness, true);         if (calcFitness(best) != 0)         {             Console.WriteLine("-- not solved with regex of length " + targetGeneLength);             continue;         }         Console.WriteLine("solved with: " + best);         break;     } } 

And the result of its application to your samples:

public void Given_Sample_A() {     var target = new[] { "00", "01", "10" };     var dontMatch = new[] { "11" };      GenerateRegex(target, dontMatch); } 

output:

Generation  1 best: 10 (2) Generation  2 best: 0+ (2) Generation  5 best: 0* (2) Generation  8 best: 00 (2) Generation  9 best: 01 (2) -- not solved with regex of length 2 Generation  1 best: 10* (2) Generation  3 best: 00* (2) Generation  4 best: 01+ (2) Generation  6 best: 10+ (2) Generation  9 best: 00? (2) Generation 11 best: 00+ (2) Generation 14 best: 0?1 (2) Generation 21 best: 0*0 (2) Generation 37 best: 1?0 (2) Generation 43 best: 10? (2) Generation 68 best: 01* (2) Generation 78 best: 1*0 (2) Generation 79 best: 0*1 (2) Generation 84 best: 0?0 (2) Generation 127 best: 01? (2) Generation 142 best: 0+1 (2) Generation 146 best: 0+0 (2) Generation 171 best: 1+0 (2) -- not solved with regex of length 3 Generation  1 best: 1*0+ (1) Generation  2 best: 0+1* (1) Generation 20 best: 1?0+ (1) Generation 31 best: 1?0* (1) -- not solved with regex of length 4 Generation  1 best: 1*00? (1) Generation  2 best: 0*1?0 (1) Generation  3 best: 1?0?0 (1) Generation  4 best: 1?00? (1) Generation  8 best: 1?00* (1) Generation 12 best: 1*0?0 (1) Generation 13 best: 1*00* (1) Generation 41 best: 0*10* (1) Generation 44 best: 1*0*0 (1) -- not solved with regex of length 5 Generation  1 best: 0+(1)? (1) Generation 36 best: 0+()1? (1) Generation 39 best: 0+(1?) (1) Generation 61 best: 1*0+1? (0) solved with: 1*0+1? 

second sample:

public void Given_Sample_B() {     var target = new[] { "00", "01", "11" };     var dontMatch = new[] { "10" };      GenerateRegex(target, dontMatch); } 

output:

Generation  1 best: 00 (2) Generation  2 best: 01 (2) Generation  7 best: 0* (2) Generation 12 best: 0+ (2) Generation 33 best: 1+ (2) Generation 36 best: 1* (2) Generation 53 best: 11 (2) -- not solved with regex of length 2 Generation  1 best: 00* (2) Generation  2 best: 0+0 (2) Generation  7 best: 0+1 (2) Generation 12 best: 00? (2) Generation 15 best: 01* (2) Generation 16 best: 0*0 (2) Generation 19 best: 01+ (2) Generation 30 best: 0?0 (2) Generation 32 best: 0*1 (2) Generation 42 best: 11* (2) Generation 43 best: 1+1 (2) Generation 44 best: 00+ (2) Generation 87 best: 01? (2) Generation 96 best: 0?1 (2) Generation 125 best: 11? (2) Generation 126 best: 1?1 (2) Generation 135 best: 11+ (2) Generation 149 best: 1*1 (2) -- not solved with regex of length 3 Generation  1 best: 0*1* (0) solved with: 0*1* 
like image 167
Handcraftsman Avatar answered Oct 11 '22 09:10

Handcraftsman