Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java Regular Expression with International Letters

Here's my current code:

return str.matches("^[A-Za-z\\-'. ]+");

I want it to include international letters. How do I do that in Java?

Thanks.

like image 704
disco.dan.silver Avatar asked Jan 31 '13 22:01

disco.dan.silver


People also ask

How do I get special characters in regex?

Special Regex Characters: These characters have special meaning in regex (to be discussed below): . , + , * , ? , ^ , $ , ( , ) , [ , ] , { , } , | , \ . Escape Sequences (\char): To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "."

Does regex work for other languages?

Regex support is part of the standard library of many programming languages, including Java and Python, and is built into the syntax of others, including Perl and ECMAScript.


1 Answers

It seems that you want is, to match all the alphabetic characters. Typically you would do that by using Posix \p{Alpha} expression, extended by the punctuation you want also to permit. As Java Regular Expressions documentation says, it matches ASCII only.

However, what documentation does not say clearly is, you can make this class work with Unicode characters. To do just that you need to turn Unicode character class matching on.
You can do this in one of two ways:

  1. By creating Pattern object passing the UNICODE_CHARACTER_CLASS constant:
    Pattern p = Pattern.compile("^[p{Alpha}\\-'. ]+", UNICODE_CHARACTER_CLASS);
  2. By using (?U) embedded pattern flag:
    str.matches("^(?U)[\\p{Alpha}\\-'. ]+");

Prove of concept:

String[] test = {"Jean-Marie Le'Blanc", "Żółć", "Ὀδυσσεύς", "原田雅彦"};
for (String str : test) {
    System.out.print(str.matches("^(?U)[\\p{Alpha}\\-'. ]+") + " ");
}

The obvious result is:

true true true true

If you think that all is correct, I have two additional points to make:

  • 原田雅彦 (Masahiko Harada) is composed of Ideographic characters. In fact they are not the alphabetic characters,
  • You want to match the dot (.) symbol. It's OK, but please consider matching Ideographic fullstops as well.
like image 193
Paweł Dyda Avatar answered Sep 19 '22 21:09

Paweł Dyda