Pattern.matcher in java 8 does not match section sign §

Question

I have this following code snippet :

private static final Pattern ESCAPER_PATTERN = Pattern.compile("[^a-zA-Z0-9\p{P}\s]*");

/**
 * @param args
 */
public static void main(String[] args)
{
    String unaccentedText = "Aa123 \/*-+.=+:/;.,?u%µ£$*^¨-)ac!e§('\"e&€#²³~´][{^";
    System.out.println(ESCAPER_PATTERN.matcher(unaccentedText).replaceAll(""));         
}

When I execute this with JDK 7 the output I get is:

Aa123 /*-.:/;.,?u%*-)ac!e('"e&#][{

When I execute the same with JDK 8 the output I get is:

Aa123 /*-.:/;.,?u%*-)ac!e§('"e&#][{

Notice that the section sign § is not removed with JDK 8.

Please let me know the regex to be used in case of JDK 8 to match the section sign as well - and also the reason for this difference in behaviour between jdks.

nhahtdh · Accepted Answer

Unicode moved your cheese

The character U+00A7 SECTION SIGN was changed from category So (Symbol, Other) to category Po (Punctuation, Other) in Unicode 6.1.0:

UnicodeData.txt

U+00A7, U+00B6, U+0F14, U+1360, and U+10102 were changed from gc=So to gc=Po.

Since Java uses Unicode 6.0.0 in version 7, and updates to Unicode 6.2.0 in version 8, it explains the difference in the result. As § now belongs Punctuation category, it is matched by \p{P} in Java 8.

Wrong solution

Since regular punctuations like !, #, ", ... also belong to Po category, we can't really remove this subcategory.

The next obvious solution is to use character set intersection to remove the unwanted character:

"[^a-zA-Z0-9\p{P}\s&&[^\u00a7]]"

... but wait a minute, there is a bug in Java with negated character class inside negated character class, the regex above compiles to:

[^a-zA-Z0-9\p{P}\s&&[^§]]
Start. Start unanchored match (minLength=1)
Pattern.intersection. S ∩ T:
  Pattern.setDifference. S ∖ T:
    Pattern.setDifference. S ∖ T:
      Pattern.setDifference. S ∖ T:
        Pattern.setDifference. S ∖ T:
          CharProperty.complement. S̄:
            Pattern.rangeFor. U+0061 <= codePoint <= U+007A.
          Pattern.rangeFor. U+0041 <= codePoint <= U+005A.
        Pattern.rangeFor. U+0030 <= codePoint <= U+0039.
      DEBUG charProp: java.util.regex.Pattern$Category
    Ctype. POSIX (US-ASCII): SPACE
  CharProperty.complement. S̄:
    BitClass. Match any of these 1 character(s):
      §
java.util.regex.Pattern$LastNode
Node. Accept match

... which resolves to [^a-zA-Z0-9\p{P}\s] intersect with [^§], instead of not ([a-zA-Z0-9\p{P}\s] intersect with [^§]).

Correct solution

To workaround the bug above, the working solution is:

"[[^a-zA-Z0-9\p{P}\s]\u00a7]"

which compiles to:

[[^a-zA-Z0-9\p{P}\s]§]
Start. Start unanchored match (minLength=1)
Pattern.union. S ∪ T:
  Pattern.setDifference. S ∖ T:
    Pattern.setDifference. S ∖ T:
      Pattern.setDifference. S ∖ T:
        Pattern.setDifference. S ∖ T:
          CharProperty.complement. S̄:
            Pattern.rangeFor. U+0061 <= codePoint <= U+007A.
          Pattern.rangeFor. U+0041 <= codePoint <= U+005A.
        Pattern.rangeFor. U+0030 <= codePoint <= U+0039.
      DEBUG charProp: java.util.regex.Pattern$Category
    Ctype. POSIX (US-ASCII): SPACE
  BitClass. Match any of these 1 character(s):
    §
java.util.regex.Pattern$LastNode
Node. Accept match

The § is correctly included in the character class this time, so the sign will be removed.

Note that I have removed the quantifier for demonstration purpose. Please add the quantifier back to the character class in your code, preferably one or more + quantifier, instead of zero or more quantifier as used in the question.

Pattern.matcher in java 8 does not match section sign §

Tags:

regex

java-8

Priya Gachinamath

1 Answers

Unicode moved your cheese

Wrong solution

Correct solution

nhahtdh

Recent Activity

Donate For Us

Pattern.matcher in java 8 does not match section sign §

Tags:

regex

java-8

Priya Gachinamath

1 Answers

Unicode moved your cheese

Wrong solution

Correct solution

nhahtdh

Related questions

Recent Activity

Donate For Us