Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java's Regular Expression don't recognize characters from other languages as word characters (i.e \w)

Lets say I have a word: "Aiavärav". The expression \w+ should capture this word, but the letter "ä" cuts the word in half. Instead of "Aiavärav", I get "Aia". What is the correct regex for words that contain those non-ascii letters?

like image 787
jyriand Avatar asked Feb 09 '12 02:02

jyriand


Video Answer


1 Answers

According to the documentation, \w only matches [a-zA-Z_0-9] unless you specify the UNICODE_CHARACTER_CLASS flag:

Pattern.compile("\\w+", Pattern.UNICODE_CHARACTER_CLASS)

or embed a (?U) in the pattern:

Pattern.compile("(?U)\\w+")

either of which requires JDK 1.7 (i.e., Java 7).

If you don't have Java 7, you can generalize \w to Unicode by using \p{L} ("letter"; like [a-zA-Z], but not ASCII-specific) and \p{N} ("number"; like [0-9], but not ASCII-specific):

Pattern.compile("[\\p{L}_\\p{N}]+")

But it sounds like maybe you're looking for actual words, in the normal sense (as opposed to the programming-language sense), and don't need to support digits and underscores? In that case, you can just use \p{L}:

Pattern.compile("\\p{L}+")

(By the way, the curly brackets are actually optional — you can write \pL instead of p{L} and \pN instead of \p{N} — but people usually include them anyway, because they're required for multi-letter categories like \p{Lu} "uppercase letter".)

like image 197
ruakh Avatar answered Oct 02 '22 16:10

ruakh