Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I expand a finite pattern into all its possible matches?

Tags:

regex

For example, given the pattern

[a-zA-Z]_[0-9]{2}

a function would take the pattern and return an array or list containing

a_00, a_01, a_02, ... , a_98, a_99, ... , z_98, z_99

Only numbers and letters (and finite groupings thereof) need be expanded. How would I go about doing this? An example in Python or Perl would be preferred. Thank you!

like image 478
Ignis Umbrae Avatar asked Aug 08 '09 10:08

Ignis Umbrae


1 Answers

First, parse the expression and build a tree reflecting the syntactic structure of the regular expression, and include a terminator node that logically appears at the end. For example, in lisp notation, your example might look like this:

(concat
  (repeat 2
    (concat
      (charset
        (range a z)
        (range A Z))
      (literal _)
      (charset
        (range 0 9))))
  terminator)

Next, thread the tree so that you can use recursion to generate the combinatorial expansion. By that, I mean e.g. that nodes a..z in (concat a .. z) all need to have pointers from one to the next, so a points to b, and so on, and the concat node itself points to its successor. The idea is that you can produce one version of the current element in the expansion, and recurse into the next element, and when the next element returns you can try the next version of the current element, etc., until all versions are exhausted and you return to your caller (the predecessor or parent). This threading is easiest done with a stack and a post-order traversal of the tree. If you carefully push nodes during post-order traversal, the top of the stack will be the next element in sequence.

(An alternative to threading is to structure the tree so that the next element in every concat node is a child of the previous node, and repeat nodes' children point back to the repeat node.)

Then write a routine (or set of routines using pattern matching, or virtual methods if nodes in the tree are implemented using polymorphism in an object-oriented language) that, for any given node type, produces the correct output and recurses into the next node or children in the appropriate way. For example, in pseudocode:

if node is of form (repeat n): # non-variable repeat
    for i in 1 to n
        recurse into child
    recurse into threaded successor

if node is of form (concat ...):
    recurse into first element # last element will recurse into successor

if node is of form (literal x):
    emit x
    recurse into successor
    remove x

if node is of form (charset ...):
    for each element in charset:
        emit element
        recurse into successor
        remove element

if node is terminator:
    add string created thus far to final output list

etc. As you can observe, children of a repeat node mustn't recurse into the successor of the repeat node, so that needs to be taken into account when threading the tree. Similar care needs to be taken that "current progress" isn't lost when reaching the end of child nodes of a repeat node; alternatively, the successor of child nodes could point to the repeat node itself (i.e. a true closure over the graph of nodes), but that will require storing a counter somewhere.

All in all, completing this could take several days, depending on how flexible and performant it needs to be. Also note that certain constructs, such as Kleene star or closure (* or + in extended regex) will result in an infinite list. A language that supports generators (e.g. C#'s iterator syntax) or coroutines / continuations (e.g. Scheme's call/cc) or lazy evaluation (e.g. Haskell) can permit iteration over the first few elements of the list without having to evaluate the entire list. Alternatively, choosing random potential output rather than exhaustive iteration may be preferable for providing examples corresponding to a regular expression.

like image 145
Barry Kelly Avatar answered Sep 29 '22 12:09

Barry Kelly