<h3>Update: In Java 11 bug described below seems to be fixed</h3> <h3>(possibly it was fixed even earlier, but I don't know in which version exactly. Bug report about similar problem linked in nhahtdh's answer suggests Java 9).</h3> <hr> TL;DR (before fix): Why <code>[^\\D2]</code>, <code>[^[^0-9]2]</code>, <code>[^2[^0-9]]</code> get different results in Java? <hr> Code used for tests. You can skip it for now. <pre class="prettyprint"><code>String[] regexes = { "[[^0-9]2]", "[\\D2]", "[013-9]", "[^\\D2]", "[^[^0-9]2]", "[^2[^0-9]]" }; String[] tests = { "x", "1", "2", "3", "^", "[", "]" }; System.out.printf("match | %9s , %6s | %6s , %6s , %6s , %10s%n", (Object[]) regexes); System.out.println("-----------------------------------------------------------------------"); for (String test : tests) System.out.printf("%5s | %9b , %6b | %7b , %6b , %10b , %10b %n", test, test.matches(regexes[0]), test.matches(regexes[1]), test.matches(regexes[2]), test.matches(regexes[3]), test.matches(regexes[4]), test.matches(regexes[5])); </code></pre> <hr> Lets say I need regex which will accept characters that are <ul> <li>not digits,</li> <li>with exception of <code>2</code>.</li> </ul> So such regex should represent every character except <code>0</code>, <code>1</code>, <code>3</code>,<code>4</code>, ... , <code>9</code>. I can write it at least in two ways which will be sum of everything which is not digit with 2: <ul> <li><code>[[^0-9]2]</code></li> <li><code>[\\D2]</code></li> </ul> Both of these regexes works as expected <pre class="prettyprint"><code>match , [[^0-9]2] , [\D2] -------------------------- x , true , true 1 , false , false 2 , true , true 3 , false , false ^ , true , true [ , true , true ] , true , true </code></pre> Now lets say I want to reverse accepted characters. (so I want to accept all digits except 2) I could create regex which explicitly contains all accepted characters like <ul> <li><code>[013-9]</code></li> </ul> or try to negate two previously described regexes by wrapping it in another <code>[^...]</code> like <ul> <li><code>[^\\D2]</code></li> <li> <code>[^[^0-9]2]</code> or even</li> <li><code>[^2[^0-9]]</code></li> </ul> but to my surprise only first two versions work as expected <pre class="prettyprint"><code>match | [[^0-9]2] , [\D2] | [013-9] , [^\D2] , [^[^0-9]2] , [^2[^0-9]] ------+--------------------+------------------------------------------- x | true , true | false , false , true , true 1 | false , false | true , true , false , true 2 | true , true | false , false , false , false 3 | false , false | true , true , false , true ^ | true , true | false , false , true , true [ | true , true | false , false , true , true ] | true , true | false , false , true , true </code></pre> So my question is why <code>[^[^0-9]2]</code> or <code>[^2[^0-9]]</code> doesn't behave as <code>[^\D2]</code>? Can I somehow correct these regexes so I would be able to use <code>[^0-9]</code> inside them?

There are some strange voodoo going on in the character class parsing code of Oracle's implementation of <code>Pattern</code> class, which comes with your JRE/JDK if you downloaded it from Oracle's website or if you are using OpenJDK. I have not checked how other JVM (notably GNU Classpath) implementations parse the regex in the question. From this point, any reference to <code>Pattern</code> class and its internal working is strictly restricted to Oracle's implementation (the reference implementation). It would take some time to read and understand how <code>Pattern</code> class parses the nested negation as shown in the question. However, I have written a program1 to extract information from a <code>Pattern</code> object (with Reflection API) to look at the result of compilation. The output below is from running my program on Java HotSpot Client VM version 1.7.0_51. 1: Currently, the program is an embarrassing mess. I will update this post with a link when I finished it and refactored it. <pre class="prettyprint"><code>[^0-9] Start. Start unanchored match (minLength=1) CharProperty.complement (character class negation). Match any character NOT matched by the following character class: Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive) LastNode Node. Accept match </code></pre> Nothing surprising here. <pre class="prettyprint"><code>[^[^0-9]] Start. Start unanchored match (minLength=1) CharProperty.complement (character class negation). Match any character NOT matched by the following character class: Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive) LastNode Node. Accept match </code></pre> <pre class="prettyprint"><code>[^[^[^0-9]]] Start. Start unanchored match (minLength=1) CharProperty.complement (character class negation). Match any character NOT matched by the following character class: Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive) LastNode Node. Accept match </code></pre> The next 2 cases above are compiled to the same program as <code>[^0-9]</code>, which is counter-intuitive. <pre class="prettyprint"><code>[[^0-9]2] Start. Start unanchored match (minLength=1) Pattern.union (character class union). Match any character matched by either character classes below: CharProperty.complement (character class negation). Match any character NOT matched by the following character class: Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive) BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s): [U+0032] 2 LastNode Node. Accept match </code></pre> <pre class="prettyprint"><code>[\D2] Start. Start unanchored match (minLength=1) Pattern.union (character class union). Match any character matched by either character classes below: CharProperty.complement (character class negation). Match any character NOT matched by the following character class: Ctype. Match POSIX character class DIGIT (US-ASCII) BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s): [U+0032] 2 LastNode Node. Accept match </code></pre> Nothing strange in the 2 cases above, as stated in the question. <pre class="prettyprint"><code>[013-9] Start. Start unanchored match (minLength=1) Pattern.union (character class union). Match any character matched by either character classes below: BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 2 character(s): [U+0030][U+0031] 01 Pattern.rangeFor (character range). Match any character within the range from code point U+0033 to code point U+0039 (both ends inclusive) LastNode Node. Accept match </code></pre> <pre class="prettyprint"><code>[^\D2] Start. Start unanchored match (minLength=1) Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class: CharProperty.complement (character class negation). Match any character NOT matched by the following character class: CharProperty.complement (character class negation). Match any character NOT matched by the following character class: Ctype. Match POSIX character class DIGIT (US-ASCII) BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s): [U+0032] 2 LastNode Node. Accept match </code></pre> These 2 cases work as expected, as stated in the question. However, take note of how the engine takes complement of the first character class (<code>\D</code>) and apply set difference to the character class consisting of the leftover. <pre class="prettyprint"><code>[^[^0-9]2] Start. Start unanchored match (minLength=1) Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class: CharProperty.complement (character class negation). Match any character NOT matched by the following character class: Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive) BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s): [U+0032] 2 LastNode Node. Accept match </code></pre> <pre class="prettyprint"><code>[^[^[^0-9]]2] Start. Start unanchored match (minLength=1) Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class: CharProperty.complement (character class negation). Match any character NOT matched by the following character class: Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive) BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s): [U+0032] 2 LastNode Node. Accept match </code></pre> <pre class="prettyprint"><code>[^[^[^[^0-9]]]2] Start. Start unanchored match (minLength=1) Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class: CharProperty.complement (character class negation). Match any character NOT matched by the following character class: Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive) BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s): [U+0032] 2 LastNode Node. Accept match </code></pre> As confirmed via testing by Keppil in the comment, the output above shows that all 3 regex above are compiled to the same program! <pre class="prettyprint"><code>[^2[^0-9]] Start. Start unanchored match (minLength=1) Pattern.union (character class union). Match any character matched by either character classes below: CharProperty.complement (character class negation). Match any character NOT matched by the following character class: BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s): [U+0032] 2 CharProperty.complement (character class negation). Match any character NOT matched by the following character class: Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive) LastNode Node. Accept match </code></pre> Instead of <code>NOT(UNION(2, NOT(0-9))</code>, which is <code>0-13-9</code>, we get <code>UNION(NOT(2), NOT(0-9))</code>, which is equivalent to <code>NOT(2)</code>. <pre class="prettyprint"><code>[^2[^[^0-9]]] Start. Start unanchored match (minLength=1) Pattern.union (character class union). Match any character matched by either character classes below: CharProperty.complement (character class negation). Match any character NOT matched by the following character class: BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s): [U+0032] 2 CharProperty.complement (character class negation). Match any character NOT matched by the following character class: Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive) LastNode Node. Accept match </code></pre> The regex <code>[^2[^[^0-9]]]</code> compiles to the same program as <code>[^2[^0-9]]</code> due to the same bug. There is an unresolved bug that seems to be of the same nature: JDK-6609854. <hr> <h3>Explanation</h3> <h3>Preliminary</h3> Below are implementation details of <code>Pattern</code> class that one should know before reading further: <ul> <li> <code>Pattern</code> class compiles a <code>String</code> into a chain of nodes, each node is in charge of a small and well-defined responsibility, and delegates the work to the next node in the chain. <code>Node</code> class is the base class of all the nodes.</li> <li> <code>CharProperty</code> class is the base class of all character-class related <code>Node</code>s.</li> <li> <code>BitClass</code> class is a subclass of <code>CharProperty</code> class that uses a <code>boolean[]</code> array to speed up matching for Latin-1 characters (code point <= 255). It has an <code>add</code> method, which allows characters to be added during compilation.</li> <li> <code>CharProperty.complement</code>, <code>Pattern.union</code>, <code>Pattern.intersection</code> are methods corresponding to set operations. What they do is self-explanatory.</li> <li> <code>Pattern.setDifference</code> is asymmetric set difference.</li> </ul> <h3>Parsing character class at first glance</h3> Before looking at the full code of <code>CharProperty clazz(boolean consume)</code> method, which is the method responsible for parsing a character class, let us look at an extremely simplified version of the code to understand the flow of the code: <pre class="prettyprint"><code>private CharProperty clazz(boolean consume) { // [Declaration and initialization of local variables - OMITTED] BitClass bits = new BitClass(); int ch = next(); for (;;) { switch (ch) { case '^': // Negates if first char in a class, otherwise literal if (firstInClass) { // [CODE OMITTED] ch = next(); continue; } else { // ^ not first in class, treat as literal break; } case '[': // [CODE OMITTED] ch = peek(); continue; case '&': // [CODE OMITTED] continue; case 0: // [CODE OMITTED] // Unclosed character class is checked here break; case ']': // [CODE OMITTED] // The only return statement in this method // is in this case break; default: // [CODE OMITTED] break; } node = range(bits); // [CODE OMITTED] ch = peek(); } } </code></pre> The code basically reads the input (the input <code>String</code> converted to null-terminated <code>int[]</code> of code points) until it hits <code>]</code> or the end of the String (unclosed character class). The code is a bit confusing with <code>continue</code> and <code>break</code> mixing together inside the <code>switch</code> block. However, as long as you realize that <code>continue</code> belongs to the outer <code>for</code> loop and <code>break</code> belongs to the <code>switch</code> block, the code is easy to understand: <ul> <li>Cases ending in <code>continue</code> will never execute the code after the <code>switch</code> statement.</li> <li>Cases ending in <code>break</code> may execute the code after the <code>switch</code> statement (if it doesn't <code>return</code> already).</li> </ul> With the observation above, we can see that whenever a character is found to be non-special and should be included in the character class, we will execute the code after the <code>switch</code> statement, in which <code>node = range(bits);</code> is the first statement. If you check the source code, the method <code>CharProperty range(BitClass bits)</code> parses "a single character or a character range in a character class". The method either returns the same <code>BitClass</code> object passed in (with new character added) or return a new instance of <code>CharProperty</code> class. <h3>The gory details</h3> Next, let us look at the full version of the code (with the part parsing character class intersection <code>&&</code> omitted): <pre class="prettyprint"><code>private CharProperty clazz(boolean consume) { CharProperty prev = null; CharProperty node = null; BitClass bits = new BitClass(); boolean include = true; boolean firstInClass = true; int ch = next(); for (;;) { switch (ch) { case '^': // Negates if first char in a class, otherwise literal if (firstInClass) { if (temp[cursor-1] != '[') break; ch = next(); include = !include; continue; } else { // ^ not first in class, treat as literal break; } case '[': firstInClass = false; node = clazz(true); if (prev == null) prev = node; else prev = union(prev, node); ch = peek(); continue; case '&': // [CODE OMITTED] // There are interesting things (bugs) here, // but it is not relevant to the discussion. continue; case 0: firstInClass = false; if (cursor >= patternLength) throw error("Unclosed character class"); break; case ']': firstInClass = false; if (prev != null) { if (consume) next(); return prev; } break; default: firstInClass = false; break; } node = range(bits); if (include) { if (prev == null) { prev = node; } else { if (prev != node) prev = union(prev, node); } } else { if (prev == null) { prev = node.complement(); } else { if (prev != node) prev = setDifference(prev, node); } } ch = peek(); } } </code></pre> Looking at the code in <code>case '[':</code> of the <code>switch</code> statement and the code after the <code>switch</code> statement: <ul> <li>The <code>node</code> variable stores the result of parsing a unit (a standalone character, a character range, a shorthand character class, a POSIX/Unicode character class or a nested character class)</li> <li>The <code>prev</code> variable stores the compilation result so far, and is always updated right after we compiles a unit in <code>node</code>.</li> </ul> Since the local variable <code>boolean include</code>, which records whether the character class is negated, is never passed to any method call, it can only be acted upon in this method alone. And the only place <code>include</code> is read and processed is after the <code>switch</code> statement. <h3>Post under construction</h3>

Bug in double negation of regex character classes?

Update: In Java 11 bug described below seems to be fixed

(possibly it was fixed even earlier, but I don't know in which version exactly. Bug report about similar problem linked in nhahtdh's answer suggests Java 9).

TL;DR (before fix):
Why [^\\D2], [^[^0-9]2], [^2[^0-9]] get different results in Java?

Code used for tests. You can skip it for now.

String[] regexes = { "[[^0-9]2]", "[\\D2]", "[013-9]", "[^\\D2]", "[^[^0-9]2]", "[^2[^0-9]]" }; String[] tests = { "x", "1", "2", "3", "^", "[", "]" };  System.out.printf("match | %9s , %6s | %6s , %6s , %6s , %10s%n", (Object[]) regexes); System.out.println("-----------------------------------------------------------------------"); for (String test : tests)     System.out.printf("%5s | %9b , %6b | %7b , %6b , %10b , %10b %n", test,             test.matches(regexes[0]), test.matches(regexes[1]),             test.matches(regexes[2]), test.matches(regexes[3]),             test.matches(regexes[4]), test.matches(regexes[5]));

Lets say I need regex which will accept characters that are

not digits,
with exception of 2.

So such regex should represent every character except 0, 1, 3,4, ... , 9. I can write it at least in two ways which will be sum of everything which is not digit with 2:

[[^0-9]2]
[\\D2]

Both of these regexes works as expected

match , [[^0-9]2] ,  [\D2] --------------------------     x ,      true ,   true     1 ,     false ,  false     2 ,      true ,   true     3 ,     false ,  false     ^ ,      true ,   true     [ ,      true ,   true     ] ,      true ,   true

Now lets say I want to reverse accepted characters. (so I want to accept all digits except 2) I could create regex which explicitly contains all accepted characters like

[013-9]

or try to negate two previously described regexes by wrapping it in another [^...] like

[^\\D2]
[^[^0-9]2]
or even
[^2[^0-9]]

but to my surprise only first two versions work as expected

match | [[^0-9]2] ,  [\D2] | [013-9] , [^\D2] , [^[^0-9]2] , [^2[^0-9]]  ------+--------------------+-------------------------------------------      x |      true ,   true |   false ,  false ,       true ,       true      1 |     false ,  false |    true ,   true ,      false ,       true      2 |      true ,   true |   false ,  false ,      false ,      false      3 |     false ,  false |    true ,   true ,      false ,       true      ^ |      true ,   true |   false ,  false ,       true ,       true      [ |      true ,   true |   false ,  false ,       true ,       true      ] |      true ,   true |   false ,  false ,       true ,       true

So my question is why [^[^0-9]2] or [^2[^0-9]] doesn't behave as [^\D2]? Can I somehow correct these regexes so I would be able to use [^0-9] inside them?

697

asked Feb 21 '14 12:02

Pshemo

2 Answers

There are some strange voodoo going on in the character class parsing code of Oracle's implementation of Pattern class, which comes with your JRE/JDK if you downloaded it from Oracle's website or if you are using OpenJDK. I have not checked how other JVM (notably GNU Classpath) implementations parse the regex in the question.

From this point, any reference to Pattern class and its internal working is strictly restricted to Oracle's implementation (the reference implementation).

It would take some time to read and understand how Pattern class parses the nested negation as shown in the question. However, I have written a program¹ to extract information from a Pattern object (with Reflection API) to look at the result of compilation. The output below is from running my program on Java HotSpot Client VM version 1.7.0_51.

^{1: Currently, the program is an embarrassing mess. I will update this post with a link when I finished it and refactored it.}

[^0-9] Start. Start unanchored match (minLength=1) CharProperty.complement (character class negation). Match any character NOT matched by the following character class:   Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive) LastNode Node. Accept match

Nothing surprising here.

[^[^0-9]] Start. Start unanchored match (minLength=1) CharProperty.complement (character class negation). Match any character NOT matched by the following character class:   Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive) LastNode Node. Accept match

[^[^[^0-9]]] Start. Start unanchored match (minLength=1) CharProperty.complement (character class negation). Match any character NOT matched by the following character class:   Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive) LastNode Node. Accept match

The next 2 cases above are compiled to the same program as [^0-9], which is counter-intuitive.

[[^0-9]2] Start. Start unanchored match (minLength=1) Pattern.union (character class union). Match any character matched by either character classes below:   CharProperty.complement (character class negation). Match any character NOT matched by the following character class:     Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)   BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):     [U+0032]     2 LastNode Node. Accept match

[\D2] Start. Start unanchored match (minLength=1) Pattern.union (character class union). Match any character matched by either character classes below:   CharProperty.complement (character class negation). Match any character NOT matched by the following character class:     Ctype. Match POSIX character class DIGIT (US-ASCII)   BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):     [U+0032]     2 LastNode Node. Accept match

Nothing strange in the 2 cases above, as stated in the question.

[013-9] Start. Start unanchored match (minLength=1) Pattern.union (character class union). Match any character matched by either character classes below:   BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 2 character(s):     [U+0030][U+0031]     01   Pattern.rangeFor (character range). Match any character within the range from code point U+0033 to code point U+0039 (both ends inclusive) LastNode Node. Accept match

[^\D2] Start. Start unanchored match (minLength=1) Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:   CharProperty.complement (character class negation). Match any character NOT matched by the following character class:     CharProperty.complement (character class negation). Match any character NOT matched by the following character class:       Ctype. Match POSIX character class DIGIT (US-ASCII)   BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):     [U+0032]     2 LastNode Node. Accept match

These 2 cases work as expected, as stated in the question. However, take note of how the engine takes complement of the first character class (\D) and apply set difference to the character class consisting of the leftover.

[^[^0-9]2] Start. Start unanchored match (minLength=1) Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:   CharProperty.complement (character class negation). Match any character NOT matched by the following character class:     Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)   BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):     [U+0032]     2 LastNode Node. Accept match

[^[^[^0-9]]2] Start. Start unanchored match (minLength=1) Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:   CharProperty.complement (character class negation). Match any character NOT matched by the following character class:     Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)   BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):     [U+0032]     2 LastNode Node. Accept match

[^[^[^[^0-9]]]2] Start. Start unanchored match (minLength=1) Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:   CharProperty.complement (character class negation). Match any character NOT matched by the following character class:     Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)   BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):     [U+0032]     2 LastNode Node. Accept match

As confirmed via testing by Keppil in the comment, the output above shows that all 3 regex above are compiled to the same program!

[^2[^0-9]] Start. Start unanchored match (minLength=1) Pattern.union (character class union). Match any character matched by either character classes below:   CharProperty.complement (character class negation). Match any character NOT matched by the following character class:     BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):       [U+0032]       2   CharProperty.complement (character class negation). Match any character NOT matched by the following character class:     Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive) LastNode Node. Accept match

Instead of NOT(UNION(2, NOT(0-9)), which is 0-13-9, we get UNION(NOT(2), NOT(0-9)), which is equivalent to NOT(2).

[^2[^[^0-9]]] Start. Start unanchored match (minLength=1) Pattern.union (character class union). Match any character matched by either character classes below:   CharProperty.complement (character class negation). Match any character NOT matched by the following character class:     BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):       [U+0032]       2   CharProperty.complement (character class negation). Match any character NOT matched by the following character class:     Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive) LastNode Node. Accept match

The regex [^2[^[^0-9]]] compiles to the same program as [^2[^0-9]] due to the same bug.

There is an unresolved bug that seems to be of the same nature: JDK-6609854.

Explanation

Preliminary

Below are implementation details of Pattern class that one should know before reading further:

Pattern class compiles a String into a chain of nodes, each node is in charge of a small and well-defined responsibility, and delegates the work to the next node in the chain. Node class is the base class of all the nodes.
CharProperty class is the base class of all character-class related Nodes.
BitClass class is a subclass of CharProperty class that uses a boolean[] array to speed up matching for Latin-1 characters (code point <= 255). It has an add method, which allows characters to be added during compilation.
CharProperty.complement, Pattern.union, Pattern.intersection are methods corresponding to set operations. What they do is self-explanatory.
Pattern.setDifference is asymmetric set difference.

Parsing character class at first glance

Before looking at the full code of CharProperty clazz(boolean consume) method, which is the method responsible for parsing a character class, let us look at an extremely simplified version of the code to understand the flow of the code:

private CharProperty clazz(boolean consume) {     // [Declaration and initialization of local variables - OMITTED]     BitClass bits = new BitClass();     int ch = next();     for (;;) {         switch (ch) {             case '^':                 // Negates if first char in a class, otherwise literal                 if (firstInClass) {                     // [CODE OMITTED]                     ch = next();                     continue;                 } else {                     // ^ not first in class, treat as literal                     break;                 }             case '[':                 // [CODE OMITTED]                 ch = peek();                 continue;             case '&':                 // [CODE OMITTED]                 continue;             case 0:                 // [CODE OMITTED]                 // Unclosed character class is checked here                 break;             case ']':                 // [CODE OMITTED]                 // The only return statement in this method                 // is in this case                 break;             default:                 // [CODE OMITTED]                 break;         }         node = range(bits);          // [CODE OMITTED]         ch = peek();     } }

The code basically reads the input (the input String converted to null-terminated int[] of code points) until it hits ] or the end of the String (unclosed character class).

The code is a bit confusing with continue and break mixing together inside the switch block. However, as long as you realize that continue belongs to the outer for loop and break belongs to the switch block, the code is easy to understand:

Cases ending in continue will never execute the code after the switch statement.
Cases ending in break may execute the code after the switch statement (if it doesn't return already).

With the observation above, we can see that whenever a character is found to be non-special and should be included in the character class, we will execute the code after the switch statement, in which node = range(bits); is the first statement.

If you check the source code, the method CharProperty range(BitClass bits) parses "a single character or a character range in a character class". The method either returns the same BitClass object passed in (with new character added) or return a new instance of CharProperty class.

The gory details

Next, let us look at the full version of the code (with the part parsing character class intersection && omitted):

private CharProperty clazz(boolean consume) {     CharProperty prev = null;     CharProperty node = null;     BitClass bits = new BitClass();     boolean include = true;     boolean firstInClass = true;     int ch = next();     for (;;) {         switch (ch) {             case '^':                 // Negates if first char in a class, otherwise literal                 if (firstInClass) {                     if (temp[cursor-1] != '[')                         break;                     ch = next();                     include = !include;                     continue;                 } else {                     // ^ not first in class, treat as literal                     break;                 }             case '[':                 firstInClass = false;                 node = clazz(true);                 if (prev == null)                     prev = node;                 else                     prev = union(prev, node);                 ch = peek();                 continue;             case '&':                 // [CODE OMITTED]                 // There are interesting things (bugs) here,                 // but it is not relevant to the discussion.                 continue;             case 0:                 firstInClass = false;                 if (cursor >= patternLength)                     throw error("Unclosed character class");                 break;             case ']':                 firstInClass = false;                  if (prev != null) {                     if (consume)                         next();                      return prev;                 }                 break;             default:                 firstInClass = false;                 break;         }         node = range(bits);          if (include) {             if (prev == null) {                 prev = node;             } else {                 if (prev != node)                     prev = union(prev, node);             }         } else {             if (prev == null) {                 prev = node.complement();             } else {                 if (prev != node)                     prev = setDifference(prev, node);             }         }         ch = peek();     } }

Looking at the code in case '[': of the switch statement and the code after the switch statement:

The node variable stores the result of parsing a unit (a standalone character, a character range, a shorthand character class, a POSIX/Unicode character class or a nested character class)
The prev variable stores the compilation result so far, and is always updated right after we compiles a unit in node.

Since the local variable boolean include, which records whether the character class is negated, is never passed to any method call, it can only be acted upon in this method alone. And the only place include is read and processed is after the switch statement.

Post under construction

184

answered Sep 29 '22 22:09

nhahtdh

According to the JavaDoc page nesting classes produces the union of the two classes, which makes it impossible to create an intersection using that notation:

To create a union, simply nest one class inside the other, such as [0-4[6-8]]. This particular union creates a single character class that matches the numbers 0, 1, 2, 3, 4, 6, 7, and 8.

To create an intersection you will have to use &&:

To create a single character class matching only the characters common to all of its nested classes, use &&, as in [0-9&&[345]]. This particular intersection creates a single character class matching only the numbers common to both character classes: 3, 4, and 5.

The last part of your problem is still a mystery to me too. The union of [^2] and [^0-9] should indeed be [^2], so [^2[^0-9]] behaves as expected. [^[^0-9]2] behaving like [^0-9] is indeed strange though.

answered Sep 29 '22 21:09

Keppil

Related questions
                            
                                IIS & Chrome: failed to load resource: net::ERR_INCOMPLETE_CHUNKED_ENCODING
                            
                                How can I combine multiple nested Substitute functions in Excel?
                            
                                How to run cordova plugins in the background?
                            
                                Run batch scripts on a remote server (windows) from jenkins
                            
                                Python Key Error=0 - Can't find Dict error in code
                            
                                How to get the label of a relationship
                            
                                Passing a Vec into a function by reference
                            
                                android schedule task to execute at specific time daily
                            
                                Deinit method is never called - Swift playground
                            
                                Kill Python Multiprocessing Pool
                            
                                How do I manage assets in Yii2?
                            
                                Idiom for long tuple unpacking [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With