Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can we use regular expressions to check if there are an odd number of each type of character?

Tags:

python

regex

The problem

I'm trying to create a regex in which we can check if all letters present in some reference set are present in some other string, but only in odd numbers (1, 3, 5, ...).

Here is a (very) crude image of the DFA representing the problem:

Odd As and Bs DFA

My (broken) solution

I started using a finite set, {a, b}, so I would essentially check "are there both an odd number of as and an odd number of bs in the string?"

Unfortunately I did not get far on my own. I first read this thread, which is remarkably similar to this concept, but was not able to glean an answer from (aa|bb|(ab|ba)(aa|bb)*(ba|ab))*(b|(ab|ba)(bb|aa)*a). (I understand how it works, but not how to convert it to check odd numbers for both items present.)

Here is what I've come up with so far: ^((ab|ba)(bb|aa)?|(bb|aa)?(ab|ba))+$. This basically checks if there is ab or ba followed by bb or aa or nothing, which would result in ab, ba, abaa, abbb, baaa, or babb. (It also does the reverse of this, checking the double-letter first.) This can then repeat, indefinitely. The problem I have is I cannot seem to adjust it to match the string bbaaba without also matching bbaa.

Additionally, the method above can not be dynamically adjusted to account for {a, b, c}, for example, though I'm willing to forgo this to solve the initial problem.

Testing

Here are my test strings and the desired output, with the reasons in parentheses:

"ba"      # True (1a, 1b)
"abbb"    # True (1a, 3b)
"bbba"    # True (1a, 3b)
"bbab"    # True (1a, 3b)
"ababab"  # True (3a, 3b)
"bbaaba"  # True (3a, 3b)
"abb"     # False (2b)
"aabb"    # False (2a, 2b)
"aabba"   # False (2b)
""        # False (0a, 0b is "even")
"a"       # False (0b is "even")
"b"       # False (0a is "even")

Question

So, is this possible through regex? Or are regular expressions more limited than a DFA? I am aware that it can be done through a basic loop, but this isn't what I'm going for.

like image 835
Cat Avatar asked Sep 14 '12 20:09

Cat


People also ask

Which of the following regular expressions gives only odd number of A?

Solution : The regular expression for odd number of a is b*ab*(ab*ab*)* and corresponding automata is given in Figure 6 and minimum number of states are 2.

Which are 3 uses of regular expression?

Regular expressions are useful in any scenario that benefits from full or partial pattern matches on strings. These are some common use cases: verify the structure of strings. extract substrings form structured strings.

Can regex be used for numbers?

The regex [0-9] matches single-digit numbers 0 to 9. [1-9][0-9] matches double-digit numbers 10 to 99. That's the easy part. Matching the three-digit numbers is a little more complicated, since we need to exclude numbers 256 through 999.

What are the two types of characters used in regular expression?

Each character in a regular expression (that is, each character in the string describing its pattern) is either a metacharacter, having a special meaning, or a regular character that has a literal meaning. For example, in the regex b. , 'b' is a literal character that matches just 'b', while '.


2 Answers

Regexes are not more limited than a DFA; in fact, they are equivalent. (Perl-style "regexes" with backreferences are strictly more powerful, so they are not "regular" at all.)

We can easily write the regex if the string contains only as:

a(aa)*

And if other letters may also occur in between, we can still do it by simply ignoring those characters:

[^a]*a([^a]*a[^a]*a)*[^a]*

Because regexes are equivalent to DFA's, we have a DFA for each of the individual letters. It's pretty simple, actually:

 [^a] _      [^a] _
     / \         / \
     | v   a     | v
---> (0) -----> ((1))
         <-----
            a

State (0) is the start state ("even number of a's seen") and state ((1)) is the only accepting state ("odd number of as seen"). If we see an a, we go to the other state; for any other character, we remain in the same state.

Now the nice thing about DFAs is that they are composable. In particular, they are closed under intersection. This means that, if we have a DFA that recognizes the language "string containing an odd number of as", and another one that recognizes the language "string containing an odd number of bs", we can combine them into a DFA that recognizes the intersection of these two languages, that is, "string containing an odd number of a's and an odd number of b's".

I won't go into detail about the algorithm but this question has some pretty good answers. The resulting DFA will have four states: "even number of as seen, even number of bs seen", "even number of as seen, odd number of bs seen", etcetera.

And because DFAs are equivalent to regexes, there also exists a regex that matches precisely those strings. Again, I won't go into details about the algorithm, but here is an article that explains it pretty well. Conveniently, it also comes with some Python 3 code to do the dirty work:

>>> from fsm import fsm
>>> a = fsm(
      alphabet = {'a', 'b'},
      states = {0, 1, 2, 3},
      initial = 0,
      finals = {3},
      map = {
        0: {'a': 1, 'b': 2},
        1: {'a': 0, 'b': 3},
        2: {'a': 3, 'b': 0},
        3: {'a': 2, 'b': 1}
      }
    )
>>> str(a.lego())
'a*(ab|b(ba*b)*(a|ba+b))((a|ba+b)(ba*b)*(a|ba+b)|ba*b)*'

There might be a bug in the library, or I'm using it wrong, because the a* at the start cannot possibly be right. But you get the idea: although it's theoretically possible, you really don't want to use regexes for this!

like image 186
Thomas Avatar answered Oct 29 '22 15:10

Thomas


Here's one way to do it, using lookaheads to assert each condition in turn.

^(?=[^a]*a(?:[^a]*a[^a]*a)*[^a]*$)(?=[^b]*b(?:[^b]*b[^b]*b)*[^b]*$)(.*)$

Here's a demo with your examples. (The \ns in the demo are for presentation purposes. Also, you can drop the (.*)$ if you only need to test the match, not capture.)

I will be adding an explanation shortly.


Explanation

We only need to look at one half:

(?=  [^a]*a  (?:[^a]*a[^a]*a)  *  [^a]*$  )
|    |       |                 |  |
|    |       |                 |  Only accept non-'a's to the end.
|    |       |                 |
|    |       |                 Zero or more of these pairs of 'a's.
|    |       |
|    |       Strictly a pair of 'a's.
|    |
|    Find the first 'a'.
|
Use a lookahead to assert multiple conditions.
like image 27
slackwing Avatar answered Oct 29 '22 13:10

slackwing