Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to match all instances not inside quotes

From this q/a, I deduced that matching all instances of a given regex not inside quotes, is impossible. That is, it can't match escaped quotes (ex: "this whole \"match\" should be taken"). If there is a way to do it that I don't know about, that would solve my problem.

If not, however, I'd like to know if there is any efficient alternative that could be used in JavaScript. I've thought about it a bit, but can't come with any elegant solutions that would work in most, if not all, cases.

Specifically, I just need the alternative to work with .split() and .replace() methods, but if it could be more generalized, that would be the best.

For Example:
An input string of:
+bar+baz"not+or\"+or+\"this+"foo+bar+
replacing + with #, not inside quotes, would return:
#bar#baz"not+or\"+or+\"this+"foo#bar#

like image 226
Azmisov Avatar asked Jun 24 '11 02:06

Azmisov


People also ask

What is difference [] and () in regex?

[] denotes a character class. () denotes a capturing group. [a-z0-9] -- One character that is in the range of a-z OR 0-9.

How do you match everything except with regex?

How do you ignore something in regex? To match any character except a list of excluded characters, put the excluded charaters between [^ and ] . The caret ^ must immediately follow the [ or else it stands for just itself.

What does \\ mean in regex?

The backslash character (\) in a regular expression indicates that the character that follows it either is a special character (as shown in the following table), or should be interpreted literally. For more information, see Character Escapes. Escaped character. Description. Pattern.

What does regex 0 * 1 * 0 * 1 * Mean?

Basically (0+1)* mathes any sequence of ones and zeroes. So, in your example (0+1)*1(0+1)* should match any sequence that has 1. It would not match 000 , but it would match 010 , 1 , 111 etc. (0+1) means 0 OR 1.


2 Answers

Actually, you can match all instances of a regex not inside quotes for any string, where each opening quote is closed again. Say, as in you example above, you want to match \+.

The key observation here is, that a word is outside quotes if there are an even number of quotes following it. This can be modeled as a look-ahead assertion:

\+(?=([^"]*"[^"]*")*[^"]*$) 

Now, you'd like to not count escaped quotes. This gets a little more complicated. Instead of [^"]* , which advanced to the next quote, you need to consider backslashes as well and use [^"\\]*. After you arrive at either a backslash or a quote, you need to ignore the next character if you encounter a backslash, or else advance to the next unescaped quote. That looks like (\\.|"([^"\\]*\\.)*[^"\\]*"). Combined, you arrive at

\+(?=([^"\\]*(\\.|"([^"\\]*\\.)*[^"\\]*"))*[^"]*$) 

I admit it is a little cryptic. =)

like image 159
Jens Avatar answered Oct 01 '22 10:10

Jens


Azmisov, resurrecting this question because you said you were looking for any efficient alternative that could be used in JavaScript and any elegant solutions that would work in most, if not all, cases.

There happens to be a simple, general solution that wasn't mentioned.

Compared with alternatives, the regex for this solution is amazingly simple:

"[^"]+"|(\+) 

The idea is that we match but ignore anything within quotes to neutralize that content (on the left side of the alternation). On the right side, we capture all the + that were not neutralized into Group 1, and the replace function examines Group 1. Here is full working code:

<script> var subject = '+bar+baz"not+these+"foo+bar+'; var regex = /"[^"]+"|(\+)/g; replaced = subject.replace(regex, function(m, group1) {     if (!group1) return m;     else return "#"; }); document.write(replaced); 

Online demo

You can use the same principle to match or split. See the question and article in the reference, which will also point you code samples.

Hope this gives you a different idea of a very general way to do this. :)

What about Empty Strings?

The above is a general answer to showcase the technique. It can be tweaked depending on your exact needs. If you worry that your text might contain empty strings, just change the quantifier inside the string-capture expression from + to *:

"[^"]*"|(\+) 

See demo.

What about Escaped Quotes?

Again, the above is a general answer to showcase the technique. Not only can the "ignore this match" regex can be refined to your needs, you can add multiple expressions to ignore. For instance, if you want to make sure escaped quotes are adequately ignored, you can start by adding an alternation \\"| in front of the other two in order to match (and ignore) straggling escaped double quotes.

Next, within the section "[^"]*" that captures the content of double-quoted strings, you can add an alternation to ensure escaped double quotes are matched before their " has a chance to turn into a closing sentinel, turning it into "(?:\\"|[^"])*"

The resulting expression has three branches:

  1. \\" to match and ignore
  2. "(?:\\"|[^"])*" to match and ignore
  3. (\+) to match, capture and handle

Note that in other regex flavors, we could do this job more easily with lookbehind, but JS doesn't support it.

The full regex becomes:

\\"|"(?:\\"|[^"])*"|(\+) 

See regex demo and full script.

Reference

  1. How to match pattern except in situations s1, s2, s3
  2. How to match a pattern unless...
like image 37
zx81 Avatar answered Oct 01 '22 09:10

zx81