Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strange behavior of capturing group in regular expression

Tags:

python

regex

Given the following simple regular expression which goal is to capture the text between quotes characters:

regexp = '"?(.+)"?'

When the input is something like:

"text"

The capturing group(1) has the following:

text"

I expected the group(1) to have text only (without the quotes). Could somebody explain what's going on and why the regular expression is capturing the " symbol even when it's outside the capturing group #1. Another strange behavior that I don't understand is why the second quote character is captured but not the first one given that both of them are optional. Finally I fixed it by using the following regex, but I would like to understand what I'm doing wrong:

regexp = '"?([^"]+)"?'
like image 684
rkachach Avatar asked Feb 19 '16 15:02

rkachach


1 Answers

Quantifiers in regular expressions are greedy: they try to match as much text as possible. Because your last " is optional (you wrote "? in your regular expression), the .+ will match it.

Using [^"] is one acceptable solution. The drawback is that your string cannot contain " characters (which may or may not be desirable, depending on the case).

Another is to make " required:

regexp = '"(.+)"'

Another one is to make the + non-greedy, by using +?. However you also need to add anchors ^ and $ (or similar, depending on the context), otherwise it'll match only the first character (t in the case of "test"):

regexp = '^"?(.+?)"?$'

This regular expression allows " characters to be in the middle of the string, so that "t"e"s"t" will result in t"e"s"t being captured by the group.

like image 88
Andrea Corbellini Avatar answered Oct 14 '22 00:10

Andrea Corbellini