Lets say I have a word: "Aiavärav". The expression \w+
should capture this word, but the letter "ä" cuts the word in half. Instead of "Aiavärav", I get "Aia". What is the correct regex for words that contain those non-ascii letters?
According to the documentation, \w
only matches [a-zA-Z_0-9]
unless you specify the UNICODE_CHARACTER_CLASS
flag:
Pattern.compile("\\w+", Pattern.UNICODE_CHARACTER_CLASS)
or embed a (?U)
in the pattern:
Pattern.compile("(?U)\\w+")
either of which requires JDK 1.7 (i.e., Java 7).
If you don't have Java 7, you can generalize \w
to Unicode by using \p{L}
("letter"; like [a-zA-Z]
, but not ASCII-specific) and \p{N}
("number"; like [0-9]
, but not ASCII-specific):
Pattern.compile("[\\p{L}_\\p{N}]+")
But it sounds like maybe you're looking for actual words, in the normal sense (as opposed to the programming-language sense), and don't need to support digits and underscores? In that case, you can just use \p{L}
:
Pattern.compile("\\p{L}+")
(By the way, the curly brackets are actually optional — you can write \pL
instead of p{L}
and \pN
instead of \p{N}
— but people usually include them anyway, because they're required for multi-letter categories like \p{Lu}
"uppercase letter".)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With