I'm trying to create a regex in which we can check if all letters present in some reference set are present in some other string, but only in odd numbers (1, 3, 5, ...).
Here is a (very) crude image of the DFA representing the problem:
I started using a finite set, {a, b}
, so I would essentially check "are there both an odd number of a
s and an odd number of b
s in the string?"
Unfortunately I did not get far on my own. I first read this thread, which is remarkably similar to this concept, but was not able to glean an answer from (aa|bb|(ab|ba)(aa|bb)*(ba|ab))*(b|(ab|ba)(bb|aa)*a)
. (I understand how it works, but not how to convert it to check odd numbers for both items present.)
Here is what I've come up with so far: ^((ab|ba)(bb|aa)?|(bb|aa)?(ab|ba))+$
. This basically checks if there is ab
or ba
followed by bb
or aa
or nothing, which would result in ab
, ba
, abaa
, abbb
, baaa
, or babb
. (It also does the reverse of this, checking the double-letter first.) This can then repeat, indefinitely. The problem I have is I cannot seem to adjust it to match the string bbaaba
without also matching bbaa
.
Additionally, the method above can not be dynamically adjusted to account for {a, b, c}
, for example, though I'm willing to forgo this to solve the initial problem.
Here are my test strings and the desired output, with the reasons in parentheses:
"ba" # True (1a, 1b)
"abbb" # True (1a, 3b)
"bbba" # True (1a, 3b)
"bbab" # True (1a, 3b)
"ababab" # True (3a, 3b)
"bbaaba" # True (3a, 3b)
"abb" # False (2b)
"aabb" # False (2a, 2b)
"aabba" # False (2b)
"" # False (0a, 0b is "even")
"a" # False (0b is "even")
"b" # False (0a is "even")
So, is this possible through regex? Or are regular expressions more limited than a DFA? I am aware that it can be done through a basic loop, but this isn't what I'm going for.
Solution : The regular expression for odd number of a is b*ab*(ab*ab*)* and corresponding automata is given in Figure 6 and minimum number of states are 2.
Regular expressions are useful in any scenario that benefits from full or partial pattern matches on strings. These are some common use cases: verify the structure of strings. extract substrings form structured strings.
The regex [0-9] matches single-digit numbers 0 to 9. [1-9][0-9] matches double-digit numbers 10 to 99. That's the easy part. Matching the three-digit numbers is a little more complicated, since we need to exclude numbers 256 through 999.
Each character in a regular expression (that is, each character in the string describing its pattern) is either a metacharacter, having a special meaning, or a regular character that has a literal meaning. For example, in the regex b. , 'b' is a literal character that matches just 'b', while '.
Regexes are not more limited than a DFA; in fact, they are equivalent. (Perl-style "regexes" with backreferences are strictly more powerful, so they are not "regular" at all.)
We can easily write the regex if the string contains only a
s:
a(aa)*
And if other letters may also occur in between, we can still do it by simply ignoring those characters:
[^a]*a([^a]*a[^a]*a)*[^a]*
Because regexes are equivalent to DFA's, we have a DFA for each of the individual letters. It's pretty simple, actually:
[^a] _ [^a] _
/ \ / \
| v a | v
---> (0) -----> ((1))
<-----
a
State (0) is the start state ("even number of a
's seen") and state ((1)) is the only accepting state ("odd number of a
s seen"). If we see an a
, we go to the other state; for any other character, we remain in the same state.
Now the nice thing about DFAs is that they are composable. In particular, they are closed under intersection. This means that, if we have a DFA that recognizes the language "string containing an odd number of a
s", and another one that recognizes the language "string containing an odd number of b
s", we can combine them into a DFA that recognizes the intersection of these two languages, that is, "string containing an odd number of a
's and an odd number of b
's".
I won't go into detail about the algorithm but this question has some pretty good answers. The resulting DFA will have four states: "even number of a
s seen, even number of b
s seen", "even number of a
s seen, odd number of b
s seen", etcetera.
And because DFAs are equivalent to regexes, there also exists a regex that matches precisely those strings. Again, I won't go into details about the algorithm, but here is an article that explains it pretty well. Conveniently, it also comes with some Python 3 code to do the dirty work:
>>> from fsm import fsm
>>> a = fsm(
alphabet = {'a', 'b'},
states = {0, 1, 2, 3},
initial = 0,
finals = {3},
map = {
0: {'a': 1, 'b': 2},
1: {'a': 0, 'b': 3},
2: {'a': 3, 'b': 0},
3: {'a': 2, 'b': 1}
}
)
>>> str(a.lego())
'a*(ab|b(ba*b)*(a|ba+b))((a|ba+b)(ba*b)*(a|ba+b)|ba*b)*'
There might be a bug in the library, or I'm using it wrong, because the a*
at the start cannot possibly be right. But you get the idea: although it's theoretically possible, you really don't want to use regexes for this!
Here's one way to do it, using lookaheads to assert each condition in turn.
^(?=[^a]*a(?:[^a]*a[^a]*a)*[^a]*$)(?=[^b]*b(?:[^b]*b[^b]*b)*[^b]*$)(.*)$
Here's a demo with your examples. (The \n
s in the demo are for presentation purposes. Also, you can drop the (.*)$
if you only need to test the match, not capture.)
I will be adding an explanation shortly.
Explanation
We only need to look at one half:
(?= [^a]*a (?:[^a]*a[^a]*a) * [^a]*$ )
| | | | |
| | | | Only accept non-'a's to the end.
| | | |
| | | Zero or more of these pairs of 'a's.
| | |
| | Strictly a pair of 'a's.
| |
| Find the first 'a'.
|
Use a lookahead to assert multiple conditions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With