Given the following simple regular expression which goal is to capture the text between quotes characters:
regexp = '"?(.+)"?'
When the input is something like:
"text"
The capturing group(1) has the following:
text"
I expected the group(1) to have text
only (without the quotes). Could somebody explain what's going on and why the regular expression is capturing the "
symbol even when it's outside the capturing group #1. Another strange behavior that I don't understand is why the second quote character is captured but not the first one given that both of them are optional. Finally I fixed it by using the following regex, but I would like to understand what I'm doing wrong:
regexp = '"?([^"]+)"?'
Quantifiers in regular expressions are greedy: they try to match as much text as possible. Because your last "
is optional (you wrote "?
in your regular expression), the .+
will match it.
Using [^"]
is one acceptable solution. The drawback is that your string cannot contain "
characters (which may or may not be desirable, depending on the case).
Another is to make "
required:
regexp = '"(.+)"'
Another one is to make the +
non-greedy, by using +?
. However you also need to add anchors ^
and $
(or similar, depending on the context), otherwise it'll match only the first character (t
in the case of "test"
):
regexp = '^"?(.+?)"?$'
This regular expression allows "
characters to be in the middle of the string, so that "t"e"s"t"
will result in t"e"s"t
being captured by the group.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With