Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between (a+b)* and (a*b*)*?

Assuming that Σ = {a, b}, I want to find out the regular expression (RE) Σ* (that being the set of all possible strings over the alphabet Σ).

I came up with below two possibilities:

(a+b)*
(a*b*)*

However, I can't decide by myself which RE is correct, or if both are bad. So, please tell me the correct answer.

like image 996
Rurou2 Avatar asked Jan 01 '23 07:01

Rurou2


2 Answers

The + operator is typically used to indicate union (|, "or") in academic regular expressions, not "one or more" as it typically means in non-academic settings (such as most regex implementations).

So, a+b means [ab] or a|b, thus (a+b)* means any string of length 0 or more, containing any number of as and bs in any order.

Likewise, (a*b*)* also means any string of length 0 or more, containing any number of as and bs in any order.

The two expressions are different ways of expressing the same language.

like image 85
Welbog Avatar answered Jan 10 '23 04:01

Welbog


In normal regular expression grammar, (a+b)* means zero or more of any sequence that start with a, then have zero or more a, then a b. This discounts things like baa (it doesn't start with a), abba, and a (there must be one exactly b after each a group), so is not correct.

(a*b*)* means zero or more of any sequence that contain zero or more a followed by zero or more b. This is more correct since it allows for either starting character, any order and quantity of characters, and so on. It also allows the empty string which I'm pretty certain should be allowed by Σ* (but I'll leave that up to you).

However, it may be better to opt for the much simpler [ab]* (or [ab]+ in the unlikely event you consider an empty string invalid). This is basically zero (one for the + variant) or more of any character drawn from the class [ab].


However, it's possible, since you're using Σ, that you may be discussing formal language theory (where Σ is common) rather than regex grammar (where it tends not to be).

If that is the case then you should understand that there are variants of the formal language where the a | b expression (effectively [ab] in regex grammar) can instead be rendered as one of a ∪ b, a ∨ b or a + b, with each of those operator symbols representing "logical or".

That would mean that (a+b)* is actually correct (as it is equivalent to the regex grammar I gave above) for what you need since it basically means any character from the set {a, b}, repeated zero or more times.

Additionally, that's also covered by your (a*b*)* option but it's almost always better to choose the simplest one that does the job :-)


And just something else to keep in mind for the formal language case. In English (for example), "a" is a word but you'd struggle to find anyone supporting the possibility that "" is also a word. Try looking it up in a dictionary :-)

In other words, any regular expression that allows an empty sequence of the language characters (such as (a+b)*) may not be suitable. You may find that (a+b)(a+b)* is a better option. This depends on whether Σ* allows for the empty sequence.

like image 45
paxdiablo Avatar answered Jan 10 '23 04:01

paxdiablo