Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to match values not surrounded by another char?

Tags:

regex

This is one of the toughest things I have ever tried to do. Over the years I have searched but I just can’t find a way to do this — match a string not surrounded by a given char, like quotes or greater/less than symbols.

A regex like this could match URLs not in HTML links, SQL table.column values not in quotes, and lots of other things.

Example with quotes: 
Match [THIS] and "something with [NOT THIS] followed by" or even [THIS].

Example with <,>, & " 
Match [URL] and <a href="[NOT URL]">or [NOT URL]</a>

Example with single quotes: 
WHERE [THIS] LIKE '%[NOT THIS]'

Basically, how do you match a string (THIS) when it is not surrounded by a given char?

\b(?:[^"'])([^"']+)(?:[^"'])\b

Here is a test pattern: a regex like what I am thinking of would match only the first "quote".

To quote, "quote me not lest I quote you!"

like image 429
Xeoncross Avatar asked Jul 28 '09 00:07

Xeoncross


People also ask

How do I match a character except space in regex?

[^ ] matches anything but a space character.

What does '$' mean in regex?

$ means "Match the end of the string" (the position after the last character in the string). Both are called anchors and ensure that the entire string is matched instead of just a substring.

How do I match any character across multiple lines in a regular expression?

The dot matches all except newlines (\r\n). So use \s\S, which will match ALL characters.

What is a whitespace character in regex?

The RegExp \s Metacharacter in JavaScript is used to find the whitespace characters. The whitespace character can be a space/tab/new line/vertical character. It is same as [ \t\n\r].


2 Answers

The best solution will depend on what you know about the input. For example, if you're looking for things that aren't enclosed in double-quotes, does that mean double-quotes will always be properly balanced? Can they be escaped by with backslashes, or by enclosing them in single-quotes?

Assuming the simplest case--no nesting, no escaping--you could use a lookahead like this:

preg_match('/THIS(?=(?:(?:[^"]*+"){2})*+[^"]*+\z)/')

After finding the target (THIS), the lookahead basically counts the double-quotes after that point until the end of the string. If there's an odd number of them, the match must have occurred inside a pair of double-quotes, so it's not valid (the lookahead fails).

As you've discovered, this problem is not well suited to regular expressions; that's why all of the proposed solutions depend on features that aren't found in real regular expressions, like capturing groups, lookarounds, reluctant and possessive quantifiers. I wouldn't even try this without possessive quantifiers or atomic groups.

EDIT: To expand this solution to account for double-quotes that can be escaped with backslashes, you just need to replace the parts of the regex that match "anything that's not a double-quote":

[^"]

with "anything that's not a quote or a backslash, or a backslash followed by anything":

(?:[^"\\]|\\.)

Since backslash-escape sequences are relatively rare, it's worthwhile to match as many unescaped characters as you can while you're in that part of the regex:

(?:[^"\\]++|\\.)

Putting it all together, the regex becomes:

'/THIS\d+(?=(?:(?:(?:[^"\\]++|\\.)*+"){2})*+(?:[^"\\]++|\\.)*+$)/'

Applied to your test string:

'Match THIS1 and "NOT THIS2" but THIS3 and "NOT "THIS4" or NOT THIS5" ' +
'but \"THIS6\" is good and \\\\"NOT THIS7\\\\".'

...it should match 'THIS1', 'THIS3', 'THIS4' and 'THIS6'.

like image 78
Alan Moore Avatar answered Nov 16 '22 02:11

Alan Moore


It is a bit tough. There are ways, as long as you don't need to keep track of nesting. For instance, let's avoid quoted stuff:

^((?:[^"\\]|\\.|"(?:[^"\\]|\\.)*")*?)THIS

Or, explaining:

^     Match from the beginning
(     Store everything from the beginning in group 1, if I want to do replace
    (?:  Non-grouping aggregation, just so I can repeat it
        [^"\\]  Anything but quote or escape character
        |       or...
        \\.     Any escaped character (ie, \", for example)
        |       or...
        "       A quote, followed by...
        (?:     ...another non-grouping aggregation, of...
            [^"\\]  Anything but quote or escape character
            |       or...
            \\.     Any escaped character
        )*      ...as many times as possible, followed by...
        "       A (closing) quote
    )*?  As many as necessary, but as few as possible
)     And this is the end of group 1
THIS  Followed by THIS

Now, there are other ways of doing this, but, perhaps, not as flexible. For instance, if you want to find THIS, as long as there wasn't a preceeding "//" or "#" sequence -- in other words, a THIS outside a comment, you could do it like this:

(?<!(?:#|//).*)THIS

Here, (?<!...) is a negative look-behind. It won't match these characters, but it will test that they do not appear before THIS.

As for any arbitrarily nested structures -- n ( closed by n ), for example -- they can't be represented by regular expressions. Perl can do it, but it's not a regular expression.

like image 33
Daniel C. Sobral Avatar answered Nov 16 '22 00:11

Daniel C. Sobral