Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Undocumented Java regex character class: \p{C}

I found an interesting regex in a Java project: "[\\p{C}&&\\S]"

I understand that the && means "set intersection", and \S is "non-whitespace", but what is \p{C}, and is it okay to use?

The java.util.regex.Pattern documentation doesn't mention it. The only similar class on the list is \p{Cntrl}, but they behave differently: they both match on control characters, but \p{C} matches twice on Unicode characters above U+FFFF, such as PILE OF POO:

public class StrangePattern {     public static void main(String[] argv) {          // As far as I can tell, this is the simplest way to create a String         // with code points above U+FFFF.         String poo = new String(Character.toChars(0x1F4A9));          System.out.println(poo);  // prints `💩`         System.out.println(poo.replaceAll("\\p{C}", "?"));  // prints `??`         System.out.println(poo.replaceAll("\\p{Cntrl}", "?"));  // prints `💩`     } } 

The only mention I've found anywhere is here:

\p{C} or \p{Other}: invisible control characters and unused code points.

However, \p{Other} does not seem to exist in Java, and the matching code points are not unused.

My Java version info:

$ java -version java version "1.8.0_92" Java(TM) SE Runtime Environment (build 1.8.0_92-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.92-b14, mixed mode) 

Bonus question: what is the likely intent of the original pattern, "[\\p{C}&&\\S]"? It occurs in a method which validates a string before it is sent in an email: if that pattern is matched, an exception with the message "Invalid string" is raised.

like image 572
doctaphred Avatar asked May 17 '17 20:05

doctaphred


People also ask

What does \\ mean in Java regex?

The backslash \ is an escape character in Java Strings. That means backslash has a predefined meaning in Java. You have to use double backslash \\ to define a single backslash. If you want to define \w , then you must be using \\w in your regex.

What is \p l in regex?

\p{L} matches a single code point in the category “letter”. If your input string is à encoded as U+0061 U+0300, it matches a without the accent.

What is character class in regex?

In the context of regular expressions, a character class is a set of characters enclosed within square brackets. It specifies the characters that will successfully match a single character from a given input string.


1 Answers

Buried down in the Pattern docs under Unicode Support, we find the following:

This class is in conformance with Level 1 of Unicode Technical Standard #18: Unicode Regular Expression, plus RL2.1 Canonical Equivalents.

...

Categories may be specified with the optional prefix Is: Both \p{L} and \p{IsL} denote the category of Unicode letters. Same as scripts and blocks, categories can also be specified by using the keyword general_category (or its short form gc) as in general_category=Lu or gc=Lu.

The supported categories are those of The Unicode Standard in the version specified by the Character class. The category names are those defined in the Standard, both normative and informative.

From Unicode Technical Standard #18, we find that C is defined to match any Other General_Category value, and that support for this is part of the requirements for Level 1 conformance. Java implements \p{C} because it claims conformance to Level 1 of UTS #18.


It probably should support \p{Other}, but apparently it doesn't.

Worse, it's violating RL1.7, required for Level 1 conformance, which requires that matching happen by code point instead of code unit:

To meet this requirement, an implementation shall handle the full range of Unicode code points, including values from U+FFFF to U+10FFFF. In particular, where UTF-16 is used, a sequence consisting of a leading surrogate followed by a trailing surrogate shall be handled as a single code point in matching.

There should be no matches for \p{C} in your test string, because your test string should be matched as a single emoji code point with General_Category=So (Other Symbol) instead of as two surrogates.

like image 66
user2357112 supports Monica Avatar answered Oct 05 '22 16:10

user2357112 supports Monica