Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String generation with regex like criteria

I wonder wether it is feasible to implement an optimal string generator Class meeting the following second thought requirements:


  • Generation criteria using regex
  • Lexicographical order enumeration.
  • Count propetry
  • Indexed access

I don't feel comfortable with regular expression: I cannot come up with a starting piece of code but I just think of a naive implementation using a TList as a base class and use a filter (Regex) against "brute force" generated string.

What are the other optimal alternatives ?


  • Ordering: First by length (shortest first), then lexicographically.
  • Specification of the range of characters to be used in the generation: All printable or any possible combination of [A-Z], [a-z], numbers, special symbols, and eventually space (regex ?).
  • String Length bounded with a given Min/Max.
  • Space of search constrained with bounds: Start string an End string with possibility of filtering (regex ?)

Last Edit

To begin with, I rephrased the header using regex like instead of regex.

I am considering to revise the first requirement as it is an open door which may lead to untractable issue.

I need suggestions and help for the correct wording.

Second thought requirements edit done. Still open to suggestion for refinement.

like image 928
menjaraz Avatar asked Apr 05 '12 17:04

menjaraz


2 Answers

Old question, but no one has answered it, the bounty is still active and I already have a solution ready, so here is a possible answer:

I once wrote a little program that does this. It is however in C++/Qt (although I write almost all my programs in Delphi, that one is in C++), doesn't support Unicode and makes no order guarantees

It works as follow:

  1. All ? {} + * | () operators are expanded (to a maximal limit), so that only character classes and backreferences remain.

    e.g. [a-c]+|t*|([x-z]){2}foo\1|(a|b)(t|u) becomes [a-c]|[a-c][a-c]|[a-c][a-c][a-c]|[a-c][a-c][a-c][a-c]||t|tt|tt|ttt|ttt|([x-z][x-z])foo\1|at|au|bt|bu

    (the | in latter expression are just notation, the program keeps each alternative subregex in a list)

  2. Backreferences to multiple characters are replaced by backreferences to single characters.

    e.g. the expression above becomes [a-c]|[a-c][a-c]|[a-c][a-c][a-c]|[a-c][a-c][a-c][a-c]||t|tt|tt|ttt|ttt|([x-z])([x-z])foo\1\2|at|au|bt|bu

    Now each alternative subregex matches a fixed length string.

  3. For each of the alternatives, all combinations of picking characters from the classes are printed:

    e.g. the expression above becomes a|b|c|aa|ba|..|cc|aaa|baa|...|ccc|aaaa|...|cccc||t|tt|tt|ttt|ttt|xxfooxx|yxfooyx|...|zzfoozz|at|au|bt|bu

You probably can add shortlex order guarantees as follow:

  1. Sort the characters in the classes alphabetically

  2. Sort the alternatives obtained in step 2. above for length

    (there are exponentially many alternatives, but usually their count is negligible compared to the count of resulting strings )

  3. Sort/exchange the character classes and backreferences, so that every reference points backward

  4. Enumerate the possible strings for a single fixed-length alternative as before, but start at the last character class, instead at the first to get an alphabetically ordering.

    (this wouldn't work, if there were any backreferences pointing forward)

  5. If there are several alternatives of the same length, enumerate them in "parallel", compare their current strings and print the alphabetically lowest. (i.e. merge the already sorted lists for each alternative.)

    This can be optimized, e.g. by detecting distinct prefixes and safe character classes in the suffix which can be enumerated without affecting the ordering. (e.g. a[a-z]|b[a-z] has distinct prefixes and the [a-z] can be enumerated without any comparisons)

like image 179
BeniBela Avatar answered Sep 16 '22 12:09

BeniBela


I'd do this by constructing the minimum Deterministic Finite Automaton for the language. If you are starting with a regex, this can be done automatically by Thompson's Construction followed by the Subset Construction and minimization. See this description for example.

With a DFA in hand, you can use something like this algorithm:

Let P = { < START, [""] > } be a set of pairs <State, list of strings>
for n = 0, 1, ... Max
  Let P' = {} be a new set 
  while P is not empty 
    Remove the pair <s, L> from P 
    For each transition s -- c --> t in alpha order of c
      if t is an accepting state,
          output l + c for each string l in L
      put <t, L + c> in P' (** i.e. append c to each string in L)
  end
  Set P = P'
end

Note that the step marked ** needs to be true set insertion, as duplicates can easily crop up.

This is a core algorithm. P can grow exponentially with output length, but this is just the price of tracking all possibilities for a future output string. The order/size/space constraints you mentioned can be ensured by maintaining sorted order in the lists L and by cutting off the search when resource limits are reached.

Edit

Here is a toy Java example where I've hard coded the DFA for simple binary floating point literals with optional minus sign. This uses a slightly different scheme than the pseudocode above to get strict sorted order of output and to accomodate character ranges.

import java.util.Comparator;
import java.util.TreeSet;

public class Test{

    public static class DFA {

        public static class Transition  {

            final int to;
            final char lo, hi; // Character range.

            public Transition(int to, char lo, char hi) {
                this.to = to;
                this.lo = lo;
                this.hi = hi;
            }

            public Transition(int to, char ch) {
                this(to, ch, ch);
            }
        }

        // transitions[i] is a vector of transitions from state i.
        final Transition [] [] transitions;

        // accepting[i] is true iff state i is accepting
        final boolean [] accepting;

        // Make a fresh immutable DFA.
        public DFA(Transition [] [] transitions, boolean [] accepting) {
            this.transitions = transitions;
            this.accepting = accepting;
        }

        // A pair is a DFA state number and the input string read to get there.
        private static class Pair {
            final int at;
            final String s;

            Pair(int at, String s) {
                this.at = at;
                this.s = s;
            }
        }

        // Compare pairs ignoring `at` states, since 
        // they are equal iff the strings are equal.
        private Comparator<Pair> emitOrder = new Comparator<Pair>() {
            @Override
            public int compare(Pair a, Pair b) {
                return a.s.compareTo(b.s);
            }
        };

        // Emit all strings accepted by the DFA of given max length.
        // Output is in sorted order.
        void emit(int maxLength) {
            TreeSet<Pair> pairs = new TreeSet<Pair>(emitOrder);
            pairs.add(new Pair(0, ""));
            for (int len = 0; len <= maxLength; ++len) {
                TreeSet<Pair> newPairs = new TreeSet<Pair>(emitOrder);
                while (!pairs.isEmpty()) {
                    Pair pair = pairs.pollFirst();
                    for (Transition x : transitions[pair.at]) {
                        for (char ch = x.lo; ch <= x.hi; ch++) {
                            String s = pair.s + ch;
                            if (newPairs.add(new Pair(x.to, s)) && accepting[x.to]) {
                                System.out.println(s);
                            }
                        }
                    }
                }
                pairs = newPairs;
            }
        }
    }

    // Emit with a little DFA for floating point numbers.
    public void run() {
        DFA.Transition [] [] transitions = {
            {   // From 0
                new DFA.Transition(1, '-'),
                new DFA.Transition(2, '.'),
                new DFA.Transition(3, '0', '1'),
            },
            {   // From 1
                new DFA.Transition(2, '.'),
                new DFA.Transition(3, '0', '1'),
            },
            {   // From 2
                new DFA.Transition(4, '0', '1'),
            },
            {   // From 3
                new DFA.Transition(3, '0', '1'),
                new DFA.Transition(4, '.'),
            },
            {   // From 4
                new DFA.Transition(4, '0', '1'),
            }  
        };
        boolean [] accepting = { false, false, false, true, true };
        new DFA(transitions, accepting).emit(4);
    }

    public static void main (String [] args) {
        new Test().run();
    }
}
like image 27
Gene Avatar answered Sep 20 '22 12:09

Gene