Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I match a quote-delimited string with a regex?

Tags:

regex

perl

If I'm trying to match a quote-delimited string with a regex, which of the following is "better" (where "better" means both more efficient and less likely to do something unexpected):

/"[^"]+"/ # match quote, then everything that's not a quote, then a quote 

or

/".+?"/   # match quote, then *anything* (non-greedy), then a quote 

Assume for this question that empty strings (i.e. "") are not an issue. It seems to me (no regex newbie, but certainly no expert) that these will be equivalent.

Update: Upon reflection, I think changing the + characters to * will handle empty strings correctly anyway.

like image 702
Graeme Perrow Avatar asked Dec 17 '08 16:12

Graeme Perrow


People also ask

How do you include a quote in regex?

Try putting a backslash ( \ ) followed by " .

How do you match double quotes in regex?

Firstly, double quote character is nothing special in regex - it's just another character, so it doesn't need escaping from the perspective of regex. However, because Java uses double quotes to delimit String constants, if you want to create a string in Java with a double quote in it, you must escape them.

Do you need to escape quotes in regex?

In order to use a literal ^ at the start or a literal $ at the end of a regex, the character must be escaped. Some flavors only use ^ and $ as metacharacters when they are at the start or end of the regex respectively. In those flavors, no additional escaping is necessary. It's usually just best to escape them anyway.


1 Answers

You should use number one, because number two is bad practice. Consider that the developer who comes after you wants to match strings that are followed by an exclamation point. Should he use:

"[^"]*"! 

or:

".*?"! 

The difference appears when you have the subject:

"one" "two"! 

The first regex matches:

"two"! 

while the second regex matches:

"one" "two"! 

Always be as specific as you can. Use the negated character class when you can.

Another difference is that [^"]* can span across lines, while .* doesn't unless you use single line mode. [^"\n]* excludes the line breaks too.

As for backtracking, the second regex backtracks for each and every character in every string that it matches. If the closing quote is missing, both regexes will backtrack through the entire file. Only the order in which then backtrack is different. Thus, in theory, the first regex is faster. In practice, you won't notice the difference.

like image 140
Jan Goyvaerts Avatar answered Sep 29 '22 21:09

Jan Goyvaerts