This is one of the toughest things I have ever tried to do. Over the years I have searched but I just can’t find a way to do this — match a string not surrounded by a given char, like quotes or greater/less than symbols.
A regex like this could match URLs not in HTML links, SQL table.column values not in quotes, and lots of other things.
Example with quotes:
Match [THIS] and "something with [NOT THIS] followed by" or even [THIS].
Example with <,>, & "
Match [URL] and <a href="[NOT URL]">or [NOT URL]</a>
Example with single quotes:
WHERE [THIS] LIKE '%[NOT THIS]'
Basically, how do you match a string (THIS) when it is not surrounded by a given char?
\b(?:[^"'])([^"']+)(?:[^"'])\b
Here is a test pattern: a regex like what I am thinking of would match only the first "quote".
To quote, "quote me not lest I quote you!"
[^ ] matches anything but a space character.
$ means "Match the end of the string" (the position after the last character in the string). Both are called anchors and ensure that the entire string is matched instead of just a substring.
The dot matches all except newlines (\r\n). So use \s\S, which will match ALL characters.
The RegExp \s Metacharacter in JavaScript is used to find the whitespace characters. The whitespace character can be a space/tab/new line/vertical character. It is same as [ \t\n\r].
The best solution will depend on what you know about the input. For example, if you're looking for things that aren't enclosed in double-quotes, does that mean double-quotes will always be properly balanced? Can they be escaped by with backslashes, or by enclosing them in single-quotes?
Assuming the simplest case--no nesting, no escaping--you could use a lookahead like this:
preg_match('/THIS(?=(?:(?:[^"]*+"){2})*+[^"]*+\z)/')
After finding the target (THIS), the lookahead basically counts the double-quotes after that point until the end of the string. If there's an odd number of them, the match must have occurred inside a pair of double-quotes, so it's not valid (the lookahead fails).
As you've discovered, this problem is not well suited to regular expressions; that's why all of the proposed solutions depend on features that aren't found in real regular expressions, like capturing groups, lookarounds, reluctant and possessive quantifiers. I wouldn't even try this without possessive quantifiers or atomic groups.
EDIT: To expand this solution to account for double-quotes that can be escaped with backslashes, you just need to replace the parts of the regex that match "anything that's not a double-quote":
[^"]
with "anything that's not a quote or a backslash, or a backslash followed by anything":
(?:[^"\\]|\\.)
Since backslash-escape sequences are relatively rare, it's worthwhile to match as many unescaped characters as you can while you're in that part of the regex:
(?:[^"\\]++|\\.)
Putting it all together, the regex becomes:
'/THIS\d+(?=(?:(?:(?:[^"\\]++|\\.)*+"){2})*+(?:[^"\\]++|\\.)*+$)/'
Applied to your test string:
'Match THIS1 and "NOT THIS2" but THIS3 and "NOT "THIS4" or NOT THIS5" ' +
'but \"THIS6\" is good and \\\\"NOT THIS7\\\\".'
...it should match 'THIS1'
, 'THIS3'
, 'THIS4'
and 'THIS6'
.
It is a bit tough. There are ways, as long as you don't need to keep track of nesting. For instance, let's avoid quoted stuff:
^((?:[^"\\]|\\.|"(?:[^"\\]|\\.)*")*?)THIS
Or, explaining:
^ Match from the beginning
( Store everything from the beginning in group 1, if I want to do replace
(?: Non-grouping aggregation, just so I can repeat it
[^"\\] Anything but quote or escape character
| or...
\\. Any escaped character (ie, \", for example)
| or...
" A quote, followed by...
(?: ...another non-grouping aggregation, of...
[^"\\] Anything but quote or escape character
| or...
\\. Any escaped character
)* ...as many times as possible, followed by...
" A (closing) quote
)*? As many as necessary, but as few as possible
) And this is the end of group 1
THIS Followed by THIS
Now, there are other ways of doing this, but, perhaps, not as flexible. For instance, if you want to find THIS, as long as there wasn't a preceeding "//" or "#" sequence -- in other words, a THIS outside a comment, you could do it like this:
(?<!(?:#|//).*)THIS
Here, (?<!...)
is a negative look-behind. It won't match these characters, but it will test that they do not appear before THIS.
As for any arbitrarily nested structures -- n (
closed by n )
, for example -- they can't be represented by regular expressions. Perl can do it, but it's not a regular expression.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With