Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex for matching a character, but not when it's enclosed in quotes

Tags:

regex

I need to match a colon (':') in a string, but not when it's enclosed by quotes - either a " or ' character.

So the following should have 2 matches

something:'firstValue':'secondValue'
something:"firstValue":'secondValue'

but this should only have 1 match

something:'no:match'
like image 368
Jaco Pretorius Avatar asked Sep 18 '09 09:09

Jaco Pretorius


2 Answers

If the regular expression implementation supports look-around assertions, try this:

:(?:(?<=["']:)|(?=["']))

This will match any colon that is either preceeded or followed by a double or single quote. So that does only consider construct like you mentioned. something:firstValue would not be matched.

It would be better if you build a little parser that reads the input byte-by-byte and remembers when quotation is open.

like image 156
Gumbo Avatar answered Oct 23 '22 11:10

Gumbo


Regular expressions are stateless. Tracking whether you are inside of quotes or not is state information. It is, therefore, impossible to handle this correctly using only a single regular expression. (Note that some "regular expression" implementations add extensions which may make this possible; I'm talking solely about "true" regular expressions here.)

Doing it with two regular expressions is possible, though, provided that you're willing to modify the original string or to work with a copy of it. In Perl:

$string =~ s/['"][^'"]*['"]//g;
my $match_count = $string =~ /:/g;

The first will find every sequence consisting of a quote, followed by any number of non-quote characters, and terminated by a second quote, and remove all such sequences from the string. This will eliminate any colons which are within quotes. (something:"firstValue":'secondValue' becomes something:: and something:'no:match' becomes something:)

The second does a simple count of the remaining colons, which will be those that weren't within quotes to start with.

Just counting the non-quoted colons doesn't seem like a particularly useful thing to do in most cases, though, so I suspect that your real goal is to split the string up into fields with colons as the field delimiter, in which case this regex-based solution is unsuitable, as it will destroy any data in quoted fields. In that case, you need to use a real parser (most CSV parsers allow you to specify the delimiter and would be ideal for this) or, in the worst case, walk through the string character-by-character and split it manually.

If you tell us the language you're using, I'm sure somebody could suggest a good parser library for that language.

like image 38
Dave Sherohman Avatar answered Oct 23 '22 12:10

Dave Sherohman