Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Character class for Unicode digits

Tags:

java

regex

I need to create a Pattern that will match all Unicode digits and alphabetic characters. So far I have "\\p{IsAlphabetic}|[0-9]".

The first part is working well for me, it's doing a good job of identifying non-Latin characters as alphabetic characters. The problem is the second half. Obviously it will only work for Arabic Numerals. The character classes \\d and \p{Digit} are also just [0-9]. The javadoc for Pattern does not seem to mention a character class for Unicode digits. Does anyone have a good solution for this problem?

For my purposes, I would accept a way to match the set of all characters for which Character.isDigit returns true.

like image 867
Aurand Avatar asked Dec 26 '22 09:12

Aurand


2 Answers

Quoting the Java docs about isDigit:

A character is a digit if its general category type, provided by getType(codePoint), is DECIMAL_DIGIT_NUMBER.

So, I believe the pattern to match digits should be \p{Nd}.

Here's a working example at ideone. As you can see, the results are consistent between Pattern.matches and Character.isDigit.

like image 183
mgibsonbr Avatar answered Dec 28 '22 21:12

mgibsonbr


Use \d, but with the (?U) flag to enable the Unicode version of predefined character classes and POSIX character classes:

(?U)\d+

or in code:

System.out.println("3๓३".matches("(?U)\\d+")); // true

Using (?U) is equivalent to compiling the regex by calling Pattern.compile() with the UNICODE_CHARACTER_CLASS flag:

Pattern pattern = Pattern.compile("\\d", Pattern.UNICODE_CHARACTER_CLASS);
like image 33
Bohemian Avatar answered Dec 28 '22 21:12

Bohemian