Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

General approach for (equivalent of) "backreferences within character class"?

In Perl regexes, expressions like \1, \2, etc. are usually interpreted as "backreferences" to previously captured groups, but not so when the \1, \2, etc. appear within a character class. In the latter case, the \ is treated as an escape character (and therefore \1 is just 1, etc.).

Therefore, if (for example) one wanted to match a string (of length greater than 1) whose first character matches its last character, but does not appear anywhere else in the string, the following regex will not do:

/\A       # match beginning of string;
 (.)      # match and capture first character (referred to subsequently by \1);
 [^\1]*   # (WRONG) match zero or more characters different from character in \1;
 \1       # match \1;
 \z       # match the end of the string;
/sx       # s: let . match newline; x: ignore whitespace, allow comments

would not work, since it matches (for example) the string 'a1a2a':

  DB<1> ( 'a1a2a' =~ /\A(.)[^\1]*\1\z/ and print "fail!" ) or print "success!"
fail!

I can usually manage to find some workaround1, but it's always rather problem-specific, and usually far more complicated-looking than what I would do if I could use backreferences within a character class.

Is there a general (and hopefully straightforward) workaround?


1 For example, for the problem in the example above, I'd use something like

/\A
 (.)              # match and capture first character (referred to subsequently
                  # by \1);
 (?!.*\1\.+\z)    # a negative lookahead assertion for "a suffix containing \1";
 .*               # substring not containing \1 (as guaranteed by the preceding
                  # negative lookahead assertion);
 \1\z             # match last character only if it is equal to the first one
/sx

...where I've replaced the reasonably straightforward (though, alas, incorrect) subexpression [^\1]* in the earlier regex with the somewhat more forbidding negative lookahead assertion (?!.*\1.+\z). This assertion basically says "give up if \1 appears anywhere beyond this point (other than at the last position)." Incidentally, I give this solution just to illustrate the sort of workarounds I referred to in the question. I don't claim that it is a particularly good one.

like image 562
kjo Avatar asked Aug 14 '13 20:08

kjo


1 Answers

This can be accomplished with a negative lookahead within a repeated group:

/\A         # match beginning of string;
 (.)        # match and capture first character (referred to subsequently by \1);
 ((?!\1).)* # match zero or more characters different from character in \1;
 \1         # match \1;
 \z         # match the end of the string;
/sx

This pattern can be used even if the group contains more than one character.

like image 175
Andrew Clark Avatar answered Oct 06 '22 23:10

Andrew Clark