Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex and escaped and unescaped delimiter

question related to this

I have a string

a\;b\\;c;d

which in Java looks like

String s = "a\\;b\\\\;c;d"

I need to split it by semicolon with following rules:

  1. If semicolon is preceded by backslash, it should not be treated as separator (between a and b).

  2. If backslash itself is escaped and therefore does not escape itself semicolon, that semicolon should be separator (between b and c).

So semicolon should be treated as separator if there is either zero or even number of backslashes before it.

For example above, I want to get following strings (double backslashes for java compiler):

a\;b\\
c
d
like image 569
lstipakov Avatar asked Oct 26 '11 11:10

lstipakov


1 Answers

You can use the regex

(?:\\.|[^;\\]++)*

to match all text between unescaped semicolons:

List<String> matchList = new ArrayList<String>();
try {
    Pattern regex = Pattern.compile("(?:\\\\.|[^;\\\\]++)*");
    Matcher regexMatcher = regex.matcher(subjectString);
    while (regexMatcher.find()) {
        matchList.add(regexMatcher.group());
    } 

Explanation:

(?:        # Match either...
 \\.       # any escaped character
|          # or...
 [^;\\]++  # any character(s) except semicolon or backslash; possessive match
)*         # Repeat any number of times.

The possessive match (++) is important to avoid catastrophic backtracking because of the nested quantifiers.

like image 70
Tim Pietzcker Avatar answered Sep 27 '22 19:09

Tim Pietzcker