Java's Regular Expression don't recognize characters from other languages as word characters (i.e \w)

Question

Lets say I have a word: "Aiavärav". The expression \w+ should capture this word, but the letter "ä" cuts the word in half. Instead of "Aiavärav", I get "Aia". What is the correct regex for words that contain those non-ascii letters?

ruakh · Accepted Answer

According to the documentation, \w only matches [a-zA-Z_0-9] unless you specify the UNICODE_CHARACTER_CLASS flag:

Pattern.compile("\w+", Pattern.UNICODE_CHARACTER_CLASS)

or embed a (?U) in the pattern:

Pattern.compile("(?U)\w+")

either of which requires JDK 1.7 (i.e., Java 7).

If you don't have Java 7, you can generalize \w to Unicode by using \p{L} ("letter"; like [a-zA-Z], but not ASCII-specific) and \p{N} ("number"; like [0-9], but not ASCII-specific):

Pattern.compile("[\p{L}_\p{N}]+")

But it sounds like maybe you're looking for actual words, in the normal sense (as opposed to the programming-language sense), and don't need to support digits and underscores? In that case, you can just use \p{L}:

Pattern.compile("\p{L}+")

(By the way, the curly brackets are actually optional — you can write \pL instead of p{L} and \pN instead of \p{N} — but people usually include them anyway, because they're required for multi-letter categories like \p{Lu} "uppercase letter".)

Java's Regular Expression don't recognize characters from other languages as word characters (i.e \w)

Tags:

java

regex

parsing

jyriand

Video Answer

1 Answers

ruakh

Recent Activity

Donate For Us

Java's Regular Expression don't recognize characters from other languages as word characters (i.e \w)

Tags:

java

regex

parsing

jyriand

Video Answer

1 Answers

ruakh

Related questions

Recent Activity

Donate For Us