Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java Regex - Allow all regular Unicode characters for names but not obscure variants

Tags:

java

regex

In java (v11) I would like to allow all characters in any language for choosing a username, so ASCII, Latin, Greek, Chinese and so on.

We tried the pattern \p{IsAlphabetic}.

But with this pattern names like "๐•ฎ๐–๐–—๐–Ž๐–˜" are allowed. I don't want to let people to style their name with such unicode characters. I want him to enter "Chris" and not "๐•ฎ๐–๐–—๐–Ž๐–˜"

It should be allowed to name yourself "ๅฐค้›จๆบช", "Linus" or "Gรถdel".

How to achieve a proper Regex not allowing strange styles in names?

like image 442
Janning Avatar asked Feb 23 '20 15:02

Janning


People also ask

Does regex support Unicode?

RegexBuddy's regex engine is fully Unicode-based starting with version 2.0. 0. RegexBuddy 1. x.x did not support Unicode at all.

What does \\ s+ mean in regex?

The plus sign + is a greedy quantifier, which means one or more times. For example, expression X+ matches one or more X characters. Therefore, the regular expression \s matches a single whitespace character, while \s+ will match one or more whitespace characters.

How do I allow only special characters in regex?

You can use this regex /^[ A-Za-z0-9_@./#&+-]*$/.

What does \\ mean in Java regex?

Backslashes in Java. The backslash \ is an escape character in Java Strings. That means backslash has a predefined meaning in Java. You have to use double backslash \\ to define a single backslash. If you want to define \w , then you must be using \\w in your regex.


2 Answers

Here is a regular expression that allows Latin, Han Chinese, Greek, Russian Cyrillic. It can be completed with more Unicode Scripts.

^(\p{sc=Han}+|\p{sc=Latin}+|\p{sc=Greek}+|\p{sc=Cyrillic})$

Demo here: https://regex101.com/r/yCt5xT/1

Here is the full list of Unicode Scripts that can be used: https://www.regular-expressions.info/unicode.html

\p{Common}
\p{Arabic}
\p{Armenian}
\p{Bengali}
\p{Bopomofo}
\p{Braille}
\p{Buhid}
\p{Canadian_Aboriginal}
\p{Cherokee}
\p{Cyrillic}
\p{Devanagari}
\p{Ethiopic}
\p{Georgian}
\p{Greek}
\p{Gujarati}
\p{Gurmukhi}
\p{Han}
\p{Hangul}
\p{Hanunoo}
\p{Hebrew}
\p{Hiragana}
\p{Inherited}
\p{Kannada}
\p{Katakana}
\p{Khmer}
\p{Lao}
\p{Latin}
\p{Limbu}
\p{Malayalam}
\p{Mongolian}
\p{Myanmar}
\p{Ogham}
\p{Oriya}
\p{Runic}
\p{Sinhala}
\p{Syriac}
\p{Tagalog}
\p{Tagbanwa}
\p{TaiLe}
\p{Tamil}
\p{Telugu}
\p{Thaana}
\p{Thai}
\p{Tibetan}
\p{Yi} 
like image 148
jordiburgos Avatar answered Sep 28 '22 12:09

jordiburgos


The challenge is that is composed of surrogate pairs, which the regex engine interprets as code points, not chars.

The solution is to match any letter using \p{L}, but exclude code points of high surrogates on up:

"[\\p{L}&&[^\\x{0d000}-\\x{10ffff}]]+"

Trying to exclude the unicode characters

"[\\p{L}&&[^\ud000-\uffff]]+" // doesn't work

doesn't work, because the surrogate pairs are merged into a single code point.


Test code:

String[] names = {"ๅฐค้›จๆบช", "Linus", "Gรถdel", "\uD835\uDD6E\uD835\uDD8D\uD835\uDD97\uD835\uDD8E\uD835\uDD98"};

for (String name : names) {
    System.out.println(name + ": " + name.matches("[\\p{L}&&[^\\x{0d000}-\\x{10ffff}]]+"));
}

Output:

ๅฐค้›จๆบช: true
Linus: true
Gรถdel: true
๐•ฎ๐–๐–—๐–Ž๐–˜: false
like image 24
Bohemian Avatar answered Sep 28 '22 12:09

Bohemian