What is the difference between (a+b)* and (a*b*)*?

Question

Assuming that Σ = {a, b}, I want to find out the regular expression (RE) Σ* (that being the set of all possible strings over the alphabet Σ).

I came up with below two possibilities:

(a+b)*
(a*b*)*

However, I can't decide by myself which RE is correct, or if both are bad. So, please tell me the correct answer.

Welbog · Accepted Answer

The + operator is typically used to indicate union (|, "or") in academic regular expressions, not "one or more" as it typically means in non-academic settings (such as most regex implementations).

So, a+b means [ab] or a|b, thus (a+b)* means any string of length 0 or more, containing any number of as and bs in any order.

Likewise, (a*b*)* also means any string of length 0 or more, containing any number of as and bs in any order.

The two expressions are different ways of expressing the same language.

paxdiablo · Answer

In normal regular expression grammar, (a+b)* means zero or more of any sequence that start with a, then have zero or more a, then a b. This discounts things like baa (it doesn't start with a), abba, and a (there must be one exactly b after each a group), so is not correct.

(a*b*)* means zero or more of any sequence that contain zero or more a followed by zero or more b. This is more correct since it allows for either starting character, any order and quantity of characters, and so on. It also allows the empty string which I'm pretty certain should be allowed by Σ* (but I'll leave that up to you).

However, it may be better to opt for the much simpler [ab]* (or [ab]+ in the unlikely event you consider an empty string invalid). This is basically zero (one for the + variant) or more of any character drawn from the class [ab].

However, it's possible, since you're using Σ, that you may be discussing formal language theory (where Σ is common) rather than regex grammar (where it tends not to be).

If that is the case then you should understand that there are variants of the formal language where the a | b expression (effectively [ab] in regex grammar) can instead be rendered as one of a ∪ b, a ∨ b or a + b, with each of those operator symbols representing "logical or".

That would mean that (a+b)* is actually correct (as it is equivalent to the regex grammar I gave above) for what you need since it basically means any character from the set {a, b}, repeated zero or more times.

Additionally, that's also covered by your (a*b*)* option but it's almost always better to choose the simplest one that does the job :-)

And just something else to keep in mind for the formal language case. In English (for example), "a" is a word but you'd struggle to find anyone supporting the possibility that "" is also a word. Try looking it up in a dictionary :-)

In other words, any regular expression that allows an empty sequence of the language characters (such as (a+b)*) may not be suitable. You may find that (a+b)(a+b)* is a better option. This depends on whether Σ* allows for the empty sequence.

What is the difference between (a+b)* and (ab)*?

Tags:

regex

computer-science

automata

Rurou2

2 Answers

Welbog

paxdiablo

Recent Activity

Donate For Us