I'm developing a C++ application to first parse regular expression strings and then perform some calculations with it. Are there any existing algorithms that can that output the number N of strings of length L that can be recognized by a given regex such as (a|ab)* | (aa|bb)*
? Or is there a mathematical formula I can use such as one involving factorials? I just want to get the number N of strings that can be recognized by such regex phrases for a given number L. Example for (a|ab)*
how many strings of length 5 (L) can be recognized by the regex. I think the answer would be 5. But for high numbers of L I was wondering if there are any algorithms or math expressions that can calculate that.
[] denotes a character class. () denotes a capturing group. [a-z0-9] -- One character that is in the range of a-z OR 0-9. (a-z0-9) -- Explicit capture of a-z0-9 .
Most library implementations of regular expressions use a backtracking algorithm that can take an exponential amount of time on some inputs.
$ means "Match the end of the string" (the position after the last character in the string). Both are called anchors and ensure that the entire string is matched instead of just a substring.
The character + in a regular expression means "match the preceding character one or more times". For example A+ matches one or more of character A. The plus character, used in a regular expression, is called a Kleene plus .
Here is an efficient algorithm based on matrix exponentiation that you can use to compute these numbers.
I'm only going to give a high level description, not code.
First, you want to use a well-known equivalence from foundations of computer science, that a (simple) regular expression is equivalent to a finite state machine.
(Recall that a finite state machine, is essentially a flow chart, in which from each node, there is a labeled edge for each letter in your alphabet to some particular other node (or maybe its a loop). Additionally, some subset of the states is called the "Accept Set", and some particular state in the flow chart is the start state. A string is said to induce a path in the finite state machine, by starting in the start state and following the labelled edges in succession. The machine accepts
a string if the final state is in the accept set, and rejects
a string otherwise.
A classical construction shows that from any regular expression we can construct a finite state machine of similar size, and from any finite state machine we can construct a regular expression of similar size. Any language (subset of all finite strings) which corresponds to a regular expression is called "regular" and a language is regular if and only if it corresponds to a finite state machine.)
For example, if I have an alphabet {a,b,c}
, and the regular expression (a|b)*
, it corresponds to a machine with two states. The start state has a loop labelled a
, a loop labelled b
, and an arrow labelled c
to the second state. The second state has three loops to itself, so you get trapped if you go there. The accept set contains only the start state.
The first step of your program is to convert a regular expression to a corresponding finite state machine. (It might be that some existing regex library already does this basically, I think that PCRE might, although I'm not sure.)
Given a finite state machine, I want to construct a corresponding stochastic matrix. In this matrix, we have a row for each state, and a column for each state, and each entry is a probability. The probability p_{i,j}
at entry (i,j)
, is equal to the probability that if I'm at state i
, and I read a random letter, I go to state j
next. So for the example I gave, the matrix is
[ 2/3 1/3 ]
[ 0 1 ]
If you want to know about strings of length k
, then using matrix exponentiation, compute the matrix M^k
where M
is the probability transition matrix above.
Finally, if q
is the start state, add up all of the entries M^k_{q, s}
for each state s
in the accept set. The sum of these probabilities is exactly equal to the probability that a random string of length k
is accepted by the regular expression. So, you can get the number of such strings by multipling by N^k
where N
is the number of letters in your alphabet.
I think the existence of this algorithm is not hard, but it's not trivial either, I once gave a harder version of this as an extra credit problem in a final exam in a theory of computation class. I don't know of any existing implementation of it, would be interested to know.
There is some significant speed-up that you get doing it this way over the naive method, when you use matrix exponentiation. That allows you to do it for large k
quickly.
I don't know if there's a more efficient, approximate solution, it would be interesting. I guess that random sampling will always give you something but maybe there's some kind of spectral algorithm, based on doing singular value decomposition of the matrix M
or something.
Note: If you actually want to implement it, I guess that you should not use floating point numbers, the matrix M
should actually be a matrix of integers. Basically you would multiply it by N
where N
is the number of letters in your alphabet. And you would skip multiplying the sum by N^k
later. I think it's easier to understand using probabilities, but in that version, M^k_{i,j}
would just be counting number of paths from i to j of length k
.
Note: As pointed out in the comments, this algorithm is polynomial in the number of bits of k
for any fixed regular expression, so it's good even for large k
. It's exponential in the worst case though in the size of the regular expression though -- to handle large and complicated regular expressions, you should use some kind of DFA minimization I guess if you want to use this method. For the simple regular expressions shown in the question, I think it should be fine.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With