Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regular expression for c# verbatim like strings (processing ""-like escapes)

I'm trying to extract information out of rc-files. In these files, "-chars in strings are escaped by doubling them ("") analog to c# verbatim strings. is ther a way to extract the string?

For example, if I have the following string "this is a ""test""" I would like to obtain this is a ""test"". It also must be non-greedy (very important).

I've tried to use the following regular expression;

"(?<text>[^""]*(""(.|""|[^"])*)*)"

However the performance was awful. I'v based it on the explanation here: http://ad.hominem.org/log/2005/05/quoted_strings.php

Has anybody any idea to cope with this using a regular expression?

like image 792
MartenBE Avatar asked Nov 21 '12 14:11

MartenBE


2 Answers

You've got some nested repetition quantifiers there. That can be catastrophic for the performance.

Try something like this:

(?<=")(?:[^"]|"")*(?=")

That can now only consume either two quotes at once... or non-quote characters. The lookbehind and lookahead assert, that the actual match is preceded and followed by a quote.

This also gets you around having to capture anything. Your desired result will simply be the full string you want (without the outer quotes).

I do not assert that the outer quotes are not doubled. Because if they were, there would be no way to distinguish them from an empty string anyway.

like image 72
Martin Ender Avatar answered Nov 09 '22 13:11

Martin Ender


This turns out to be a lot simpler than you'd expect. A string literal with escaped quotes looks exactly like a bunch of simple string literals run together:

"Some ""escaped"" quotes"

"Some " + "escaped" + " quotes"

So this is all you need to match it:

(?:"[^"]*")+

You'll have to strip off the leading and trailing quotes in a separate step, but that's not a big deal. You would need a separate step anyway, to unescape the escaped quotes (\" or "").

like image 2
Alan Moore Avatar answered Nov 09 '22 14:11

Alan Moore