In java (v11) I would like to allow all characters in any language for choosing a username, so ASCII, Latin, Greek, Chinese and so on.
We tried the pattern \p{IsAlphabetic}
.
But with this pattern names like "๐ฎ๐๐๐๐" are allowed. I don't want to let people to style their name with such unicode characters. I want him to enter "Chris" and not "๐ฎ๐๐๐๐"
It should be allowed to name yourself "ๅฐค้จๆบช", "Linus" or "Gรถdel".
How to achieve a proper Regex not allowing strange styles in names?
RegexBuddy's regex engine is fully Unicode-based starting with version 2.0. 0. RegexBuddy 1. x.x did not support Unicode at all.
The plus sign + is a greedy quantifier, which means one or more times. For example, expression X+ matches one or more X characters. Therefore, the regular expression \s matches a single whitespace character, while \s+ will match one or more whitespace characters.
You can use this regex /^[ A-Za-z0-9_@./#&+-]*$/.
Backslashes in Java. The backslash \ is an escape character in Java Strings. That means backslash has a predefined meaning in Java. You have to use double backslash \\ to define a single backslash. If you want to define \w , then you must be using \\w in your regex.
Here is a regular expression that allows Latin, Han Chinese, Greek, Russian Cyrillic. It can be completed with more Unicode Scripts.
^(\p{sc=Han}+|\p{sc=Latin}+|\p{sc=Greek}+|\p{sc=Cyrillic})$
Demo here: https://regex101.com/r/yCt5xT/1
Here is the full list of Unicode Scripts that can be used: https://www.regular-expressions.info/unicode.html
\p{Common}
\p{Arabic}
\p{Armenian}
\p{Bengali}
\p{Bopomofo}
\p{Braille}
\p{Buhid}
\p{Canadian_Aboriginal}
\p{Cherokee}
\p{Cyrillic}
\p{Devanagari}
\p{Ethiopic}
\p{Georgian}
\p{Greek}
\p{Gujarati}
\p{Gurmukhi}
\p{Han}
\p{Hangul}
\p{Hanunoo}
\p{Hebrew}
\p{Hiragana}
\p{Inherited}
\p{Kannada}
\p{Katakana}
\p{Khmer}
\p{Lao}
\p{Latin}
\p{Limbu}
\p{Malayalam}
\p{Mongolian}
\p{Myanmar}
\p{Ogham}
\p{Oriya}
\p{Runic}
\p{Sinhala}
\p{Syriac}
\p{Tagalog}
\p{Tagbanwa}
\p{TaiLe}
\p{Tamil}
\p{Telugu}
\p{Thaana}
\p{Thai}
\p{Tibetan}
\p{Yi}
The challenge is that is composed of surrogate pairs, which the regex engine interprets as code points, not chars.
The solution is to match any letter using \p{L}
, but exclude code points of high surrogates on up:
"[\\p{L}&&[^\\x{0d000}-\\x{10ffff}]]+"
Trying to exclude the unicode characters
"[\\p{L}&&[^\ud000-\uffff]]+" // doesn't work
doesn't work, because the surrogate pairs are merged into a single code point.
Test code:
String[] names = {"ๅฐค้จๆบช", "Linus", "Gรถdel", "\uD835\uDD6E\uD835\uDD8D\uD835\uDD97\uD835\uDD8E\uD835\uDD98"};
for (String name : names) {
System.out.println(name + ": " + name.matches("[\\p{L}&&[^\\x{0d000}-\\x{10ffff}]]+"));
}
Output:
ๅฐค้จๆบช: true
Linus: true
Gรถdel: true
๐ฎ๐๐๐๐: false
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With