I have this following code snippet :
private static final Pattern ESCAPER_PATTERN = Pattern.compile("[^a-zA-Z0-9\\p{P}\\s]*");
/**
* @param args
*/
public static void main(String[] args)
{
String unaccentedText = "Aa123 \\/*-+.=+:/;.,?u%µ£$*^¨-)ac!e§('\"e&€#²³~´][{^";
System.out.println(ESCAPER_PATTERN.matcher(unaccentedText).replaceAll(""));
}
When I execute this with JDK 7 the output I get is:
Aa123 \/*-.:/;.,?u%*-)ac!e('"e&#][{
When I execute the same with JDK 8 the output I get is:
Aa123 \/*-.:/;.,?u%*-)ac!e§('"e&#][{
Notice that the section sign § is not removed with JDK 8.
Please let me know the regex to be used in case of JDK 8 to match the section sign as well - and also the reason for this difference in behaviour between jdks.
The character U+00A7 SECTION SIGN was changed from category So (Symbol, Other) to category Po (Punctuation, Other) in Unicode 6.1.0:
UnicodeData.txt
- U+00A7, U+00B6, U+0F14, U+1360, and U+10102 were changed from gc=So to gc=Po.
Since Java uses Unicode 6.0.0 in version 7, and updates to Unicode 6.2.0 in version 8, it explains the difference in the result. As § now belongs Punctuation category, it is matched by \p{P} in Java 8.
Since regular punctuations like !, #, ", ... also belong to Po category, we can't really remove this subcategory.
The next obvious solution is to use character set intersection to remove the unwanted character:
"[^a-zA-Z0-9\\p{P}\\s&&[^\u00a7]]"
... but wait a minute, there is a bug in Java with negated character class inside negated character class, the regex above compiles to:
[^a-zA-Z0-9\p{P}\s&&[^§]]
Start. Start unanchored match (minLength=1)
Pattern.intersection. S ∩ T:
Pattern.setDifference. S ∖ T:
Pattern.setDifference. S ∖ T:
Pattern.setDifference. S ∖ T:
Pattern.setDifference. S ∖ T:
CharProperty.complement. S̄:
Pattern.rangeFor. U+0061 <= codePoint <= U+007A.
Pattern.rangeFor. U+0041 <= codePoint <= U+005A.
Pattern.rangeFor. U+0030 <= codePoint <= U+0039.
DEBUG charProp: java.util.regex.Pattern$Category
Ctype. POSIX (US-ASCII): SPACE
CharProperty.complement. S̄:
BitClass. Match any of these 1 character(s):
§
java.util.regex.Pattern$LastNode
Node. Accept match
... which resolves to [^a-zA-Z0-9\p{P}\s] intersect with [^§], instead of not ([a-zA-Z0-9\p{P}\s] intersect with [^§]).
To workaround the bug above, the working solution is:
"[[^a-zA-Z0-9\\p{P}\\s]\u00a7]"
which compiles to:
[[^a-zA-Z0-9\p{P}\s]§]
Start. Start unanchored match (minLength=1)
Pattern.union. S ∪ T:
Pattern.setDifference. S ∖ T:
Pattern.setDifference. S ∖ T:
Pattern.setDifference. S ∖ T:
Pattern.setDifference. S ∖ T:
CharProperty.complement. S̄:
Pattern.rangeFor. U+0061 <= codePoint <= U+007A.
Pattern.rangeFor. U+0041 <= codePoint <= U+005A.
Pattern.rangeFor. U+0030 <= codePoint <= U+0039.
DEBUG charProp: java.util.regex.Pattern$Category
Ctype. POSIX (US-ASCII): SPACE
BitClass. Match any of these 1 character(s):
§
java.util.regex.Pattern$LastNode
Node. Accept match
The § is correctly included in the character class this time, so the sign will be removed.
Note that I have removed the quantifier for demonstration purpose. Please add the quantifier back to the character class in your code, preferably one or more + quantifier, instead of zero or more quantifier as used in the question.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With