Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex; backreferencing a character that was NOT matched in a character set

I want to construct a regex, that matches either ' or " and then matches other characters, ending when a ' or an " respectively is matched, depending on what was encountered right at the start. So this problem appears simple enough to solve with the use of a backreference at the end; here is some regex code below (it's in Java so mind the extra escape chars such as the \ before the "):

private static String seekerTwo = "(['\"])([a-zA-Z])([a-zA-Z0-9():;/`\\=\\.\\,\\- ]+)(\\1)";

This code will successfully deal with things such as:

"hello my name is bob"
'i live in bethnal green'

The trouble comes when I have a String like this:

"hello this seat 'may be taken' already"

Using the above regex on it will fail on the initial part upon encountering ' then it would continue and successfully match 'may be taken'... but this is obviously insufficient, I need the whole String to be matched.

What I'm thinking, is that I need a way to ignore the type of quotation mark, which was NOT matched in the very first group, by including it as a character in the character set of the 3rd group. However, I know of no way to do this. Is there some sort of sneaky NOT backreference function or something? Something I can use to reference the character in the 1st group that was NOT matched?? Or otherwise some kind of solution to my predicament?

like image 201
flamming_python Avatar asked Mar 15 '12 11:03

flamming_python


1 Answers

This can be done using negative lookahead assertions. The following solution even takes into account that you could escape a quote inside a string:

(["'])(?:\\.|(?!\1).)*\1

Explanation:

(["'])    # Match and remember a quote.
(?:       # Either match...
 \\.      # an escaped character
|         # or
 (?!\1)   # (unless that character is identical to the quote character in \1)
 .        # any character
)*        # any number of times.
\1        # Match the corresponding quote.

This correctly matches "hello this seat 'may be taken' already" or "hello this seat \"may be taken\" already".

In Java, with all the backslashes:

Pattern regex = Pattern.compile(
    "([\"'])   # Match and remember a quote.\n" +
    "(?:       # Either match...\n" +
    " \\\\.    # an escaped character\n" +
    "|         # or\n" +
    " (?!\\1)  # (unless that character is identical to the matched quote char)\n" +
    " .        # any character\n" +
    ")*        # any number of times.\n" +
    "\\1       # Match the corresponding quote", 
    Pattern.COMMENTS);
like image 61
Tim Pietzcker Avatar answered Sep 20 '22 06:09

Tim Pietzcker