Assuming that Σ = {a, b}
, I want to find out the regular expression (RE) Σ*
(that being the set of all possible strings over the alphabet Σ
).
I came up with below two possibilities:
(a+b)*
(a*b*)*
However, I can't decide by myself which RE is correct, or if both are bad. So, please tell me the correct answer.
The +
operator is typically used to indicate union (|
, "or") in academic regular expressions, not "one or more" as it typically means in non-academic settings (such as most regex implementations).
So, a+b
means [ab]
or a|b
, thus (a+b)*
means any string of length 0 or more, containing any number of a
s and b
s in any order.
Likewise, (a*b*)*
also means any string of length 0 or more, containing any number of a
s and b
s in any order.
The two expressions are different ways of expressing the same language.
In normal regular expression grammar, (a+b)*
means zero or more of any sequence that start with a
, then have zero or more a
, then a b
. This discounts things like baa
(it doesn't start with a
), abba
, and a
(there must be one exactly b
after each a
group), so is not correct.
(a*b*)*
means zero or more of any sequence that contain zero or more a
followed by zero or more b
. This is more correct since it allows for either starting character, any order and quantity of characters, and so on. It also allows the empty string which I'm pretty certain should be allowed by Σ*
(but I'll leave that up to you).
However, it may be better to opt for the much simpler [ab]*
(or [ab]+
in the unlikely event you consider an empty string invalid). This is basically zero (one for the +
variant) or more of any character drawn from the class [ab]
.
However, it's possible, since you're using Σ
, that you may be discussing formal language theory (where Σ
is common) rather than regex grammar (where it tends not to be).
If that is the case then you should understand that there are variants of the formal language where the a | b
expression (effectively [ab]
in regex grammar) can instead be rendered as one of a ∪ b
, a ∨ b
or a + b
, with each of those operator symbols representing "logical or".
That would mean that (a+b)*
is actually correct (as it is equivalent to the regex grammar I gave above) for what you need since it basically means any character from the set {a, b}
, repeated zero or more times.
Additionally, that's also covered by your (a*b*)*
option but it's almost always better to choose the simplest one that does the job :-)
And just something else to keep in mind for the formal language case. In English (for example), "a"
is a word but you'd struggle to find anyone supporting the possibility that ""
is also a word. Try looking it up in a dictionary :-)
In other words, any regular expression that allows an empty sequence of the language characters (such as (a+b)*
) may not be suitable. You may find that (a+b)(a+b)*
is a better option. This depends on whether Σ*
allows for the empty sequence.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With