<h3>The problem</h3> I'm trying to create a regex in which we can check if all letters present in some reference set are present in some other string, but only in odd numbers (1, 3, 5, ...). Here is a (very) crude image of the DFA representing the problem: <img src="https://i.stack.imgur.com/AoPH3.png" alt="Odd As and Bs DFA"> <h3>My (broken) solution</h3> I started using a finite set, <code>{a, b}</code>, so I would essentially check "are there both an odd number of <code>a</code>s and an odd number of <code>b</code>s in the string?" Unfortunately I did not get far on my own. I first read this thread, which is remarkably similar to this concept, but was not able to glean an answer from <code>(aa|bb|(ab|ba)(aa|bb)*(ba|ab))*(b|(ab|ba)(bb|aa)*a)</code>. (I understand how it works, but not how to convert it to check odd numbers for both items present.) Here is what I've come up with so far: <code>^((ab|ba)(bb|aa)?|(bb|aa)?(ab|ba))+$</code>. This basically checks if there is <code>ab</code> or <code>ba</code> followed by <code>bb</code> or <code>aa</code> or nothing, which would result in <code>ab</code>, <code>ba</code>, <code>abaa</code>, <code>abbb</code>, <code>baaa</code>, or <code>babb</code>. (It also does the reverse of this, checking the double-letter first.) This can then repeat, indefinitely. The problem I have is I cannot seem to adjust it to match the string <code>bbaaba</code> without also matching <code>bbaa</code>. Additionally, the method above can not be dynamically adjusted to account for <code>{a, b, c}</code>, for example, though I'm willing to forgo this to solve the initial problem. <h3>Testing</h3> Here are my test strings and the desired output, with the reasons in parentheses: <pre class="prettyprint"><code>"ba" # True (1a, 1b) "abbb" # True (1a, 3b) "bbba" # True (1a, 3b) "bbab" # True (1a, 3b) "ababab" # True (3a, 3b) "bbaaba" # True (3a, 3b) "abb" # False (2b) "aabb" # False (2a, 2b) "aabba" # False (2b) "" # False (0a, 0b is "even") "a" # False (0b is "even") "b" # False (0a is "even") </code></pre> <h3>Question</h3> So, is this possible through regex? Or are regular expressions more limited than a DFA? I am aware that it can be done through a basic loop, but this isn't what I'm going for.

Regexes are not more limited than a DFA; in fact, they are equivalent. (Perl-style "regexes" with backreferences are strictly more powerful, so they are not "regular" at all.) We can easily write the regex if the string contains only <code>a</code>s: <pre class="prettyprint"><code>a(aa)* </code></pre> And if other letters may also occur in between, we can still do it by simply ignoring those characters: <pre class="prettyprint"><code>[^a]*a([^a]*a[^a]*a)*[^a]* </code></pre> Because regexes are equivalent to DFA's, we have a DFA for each of the individual letters. It's pretty simple, actually: <pre class="prettyprint"><code> [^a] _ [^a] _ / \ / \ | v a | v ---> (0) -----> ((1)) <----- a </code></pre> State (0) is the start state ("even number of <code>a</code>'s seen") and state ((1)) is the only accepting state ("odd number of <code>a</code>s seen"). If we see an <code>a</code>, we go to the other state; for any other character, we remain in the same state. Now the nice thing about DFAs is that they are composable. In particular, they are closed under intersection. This means that, if we have a DFA that recognizes the language "string containing an odd number of <code>a</code>s", and another one that recognizes the language "string containing an odd number of <code>b</code>s", we can combine them into a DFA that recognizes the intersection of these two languages, that is, "string containing an odd number of <code>a</code>'s and an odd number of <code>b</code>'s". I won't go into detail about the algorithm but this question has some pretty good answers. The resulting DFA will have four states: "even number of <code>a</code>s seen, even number of <code>b</code>s seen", "even number of <code>a</code>s seen, odd number of <code>b</code>s seen", etcetera. And because DFAs are equivalent to regexes, there also exists a regex that matches precisely those strings. Again, I won't go into details about the algorithm, but here is an article that explains it pretty well. Conveniently, it also comes with some Python 3 code to do the dirty work: <pre class="prettyprint"><code>>>> from fsm import fsm >>> a = fsm( alphabet = {'a', 'b'}, states = {0, 1, 2, 3}, initial = 0, finals = {3}, map = { 0: {'a': 1, 'b': 2}, 1: {'a': 0, 'b': 3}, 2: {'a': 3, 'b': 0}, 3: {'a': 2, 'b': 1} } ) >>> str(a.lego()) 'a*(ab|b(ba*b)*(a|ba+b))((a|ba+b)(ba*b)*(a|ba+b)|ba*b)*' </code></pre> There might be a bug in the library, or I'm using it wrong, because the <code>a*</code> at the start cannot possibly be right. But you get the idea: although it's theoretically possible, you really don't want to use regexes for this!

Here's one way to do it, using lookaheads to assert each condition in turn. <pre class="prettyprint"><code>^(?=[^a]*a(?:[^a]*a[^a]*a)*[^a]*$)(?=[^b]*b(?:[^b]*b[^b]*b)*[^b]*$)(.*)$ </code></pre> Here's a demo with your examples. (The <code>\n</code>s in the demo are for presentation purposes. Also, you can drop the <code>(.*)$</code> if you only need to test the match, not capture.) I will be adding an explanation shortly. <hr> Explanation We only need to look at one half: <pre class="prettyprint"><code>(?= [^a]*a (?:[^a]*a[^a]*a) * [^a]*$ ) | | | | | | | | | Only accept non-'a's to the end. | | | | | | | Zero or more of these pairs of 'a's. | | | | | Strictly a pair of 'a's. | | | Find the first 'a'. | Use a lookahead to assert multiple conditions. </code></pre>

Can we use regular expressions to check if there are an odd number of each type of character?

The problem

I'm trying to create a regex in which we can check if all letters present in some reference set are present in some other string, but only in odd numbers (1, 3, 5, ...).

Here is a (very) crude image of the DFA representing the problem:

Odd As and Bs DFA

My (broken) solution

I started using a finite set, {a, b}, so I would essentially check "are there both an odd number of as and an odd number of bs in the string?"

Unfortunately I did not get far on my own. I first read this thread, which is remarkably similar to this concept, but was not able to glean an answer from (aa|bb|(ab|ba)(aa|bb)*(ba|ab))*(b|(ab|ba)(bb|aa)*a). (I understand how it works, but not how to convert it to check odd numbers for both items present.)

Here is what I've come up with so far: ^((ab|ba)(bb|aa)?|(bb|aa)?(ab|ba))+$. This basically checks if there is ab or ba followed by bb or aa or nothing, which would result in ab, ba, abaa, abbb, baaa, or babb. (It also does the reverse of this, checking the double-letter first.) This can then repeat, indefinitely. The problem I have is I cannot seem to adjust it to match the string bbaaba without also matching bbaa.

Additionally, the method above can not be dynamically adjusted to account for {a, b, c}, for example, though I'm willing to forgo this to solve the initial problem.

Testing

Here are my test strings and the desired output, with the reasons in parentheses:

Click to copy

"ba"      # True (1a, 1b)
"abbb"    # True (1a, 3b)
"bbba"    # True (1a, 3b)
"bbab"    # True (1a, 3b)
"ababab"  # True (3a, 3b)
"bbaaba"  # True (3a, 3b)
"abb"     # False (2b)
"aabb"    # False (2a, 2b)
"aabba"   # False (2b)
""        # False (0a, 0b is "even")
"a"       # False (0b is "even")
"b"       # False (0a is "even")

Question

So, is this possible through regex? Or are regular expressions more limited than a DFA? I am aware that it can be done through a basic loop, but this isn't what I'm going for.

835

asked Sep 14 '12 20:09

Cat

2 Answers

Regexes are not more limited than a DFA; in fact, they are equivalent. (Perl-style "regexes" with backreferences are strictly more powerful, so they are not "regular" at all.)

We can easily write the regex if the string contains only as:

Click to copy

a(aa)*

And if other letters may also occur in between, we can still do it by simply ignoring those characters:

Click to copy

[^a]*a([^a]*a[^a]*a)*[^a]*

Because regexes are equivalent to DFA's, we have a DFA for each of the individual letters. It's pretty simple, actually:

Click to copy

 [^a] _      [^a] _
     / \         / \
     | v   a     | v
---> (0) -----> ((1))
         <-----
            a

State (0) is the start state ("even number of a's seen") and state ((1)) is the only accepting state ("odd number of as seen"). If we see an a, we go to the other state; for any other character, we remain in the same state.

Now the nice thing about DFAs is that they are composable. In particular, they are closed under intersection. This means that, if we have a DFA that recognizes the language "string containing an odd number of as", and another one that recognizes the language "string containing an odd number of bs", we can combine them into a DFA that recognizes the intersection of these two languages, that is, "string containing an odd number of a's and an odd number of b's".

I won't go into detail about the algorithm but this question has some pretty good answers. The resulting DFA will have four states: "even number of as seen, even number of bs seen", "even number of as seen, odd number of bs seen", etcetera.

And because DFAs are equivalent to regexes, there also exists a regex that matches precisely those strings. Again, I won't go into details about the algorithm, but here is an article that explains it pretty well. Conveniently, it also comes with some Python 3 code to do the dirty work:

Click to copy

>>> from fsm import fsm
>>> a = fsm(
      alphabet = {'a', 'b'},
      states = {0, 1, 2, 3},
      initial = 0,
      finals = {3},
      map = {
        0: {'a': 1, 'b': 2},
        1: {'a': 0, 'b': 3},
        2: {'a': 3, 'b': 0},
        3: {'a': 2, 'b': 1}
      }
    )
>>> str(a.lego())
'a*(ab|b(ba*b)*(a|ba+b))((a|ba+b)(ba*b)*(a|ba+b)|ba*b)*'

There might be a bug in the library, or I'm using it wrong, because the a* at the start cannot possibly be right. But you get the idea: although it's theoretically possible, you really don't want to use regexes for this!

186

answered Oct 29 '22 15:10

Thomas

Here's one way to do it, using lookaheads to assert each condition in turn.

Click to copy

^(?=[^a]*a(?:[^a]*a[^a]*a)*[^a]*$)(?=[^b]*b(?:[^b]*b[^b]*b)*[^b]*$)(.*)$

Here's a demo with your examples. (The \ns in the demo are for presentation purposes. Also, you can drop the (.*)$ if you only need to test the match, not capture.)

I will be adding an explanation shortly.

Explanation

We only need to look at one half:

Click to copy

(?=  [^a]*a  (?:[^a]*a[^a]*a)  *  [^a]*$  )
|    |       |                 |  |
|    |       |                 |  Only accept non-'a's to the end.
|    |       |                 |
|    |       |                 Zero or more of these pairs of 'a's.
|    |       |
|    |       Strictly a pair of 'a's.
|    |
|    Find the first 'a'.
|
Use a lookahead to assert multiple conditions.

answered Oct 29 '22 13:10

slackwing

Related questions
                            
                                python: inheriting or composition
                            
                                how to print chinese word in my code.. using python
                            
                                Possible to use pyplot without DISPLAY?
                            
                                Python Error-Checking Standard Practice
                            
                                Comparing two text files in python
                            
                                Python Module Initialization Order?
                            
                                Doc, rtf and txt reader in python
                            
                                VLC Python EventManager callback type?
                            
                                Does Python have a module to convert CSS styles to inline styles for emails?
                            
                                How to get the number of active threads started by specific class?
                            
                                numpy: column-wise dot product
                            
                                How to determine when DST starts or ends in a specific location in Python? [duplicate]
                            
                                Paramiko and exec_command - killing remote process?
                            
                                Subclassing and overriding a generator function in python
                            
                                Why is Pydev giving a syntax error for built-in keywords?
                            
                                Python : UnicodeEncodeError: 'latin-1' codec can't encode character
                            
                                django manage.py settings default
                            
                                Debugging code in the Python interpreter
                            
                                Simple tutorial for Neo4J and using it with django + python
                            
                                Read a text file with non-ASCII characters in an unknown encoding

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can we use regular expressions to check if there are an odd number of each type of character?

Tags:

python

regex

The problem

My (broken) solution

Testing

Question

Cat

People also ask

2 Answers

Thomas

slackwing

Recent Activity

Donate For Us