Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Java Regex - Allow all regular Unicode characters for names but not obscure variants




In java (v11) I would like to allow all characters in any language for choosing a username, so ASCII, Latin, Greek, Chinese and so on.

We tried the pattern \p{IsAlphabetic}.

But with this pattern names like "๐•ฎ๐–๐–—๐–Ž๐–˜" are allowed. I don't want to let people to style their name with such unicode characters. I want him to enter "Chris" and not "๐•ฎ๐–๐–—๐–Ž๐–˜"

It should be allowed to name yourself "ๅฐค้›จๆบช", "Linus" or "Gรถdel".

How to achieve a proper Regex not allowing strange styles in names?

like image 442
Janning Avatar asked Feb 23 '20 15:02


People also ask

Does regex support Unicode?

RegexBuddy's regex engine is fully Unicode-based starting with version 2.0. 0. RegexBuddy 1. x.x did not support Unicode at all.

What does \\ s+ mean in regex?

The plus sign + is a greedy quantifier, which means one or more times. For example, expression X+ matches one or more X characters. Therefore, the regular expression \s matches a single whitespace character, while \s+ will match one or more whitespace characters.

How do I allow only special characters in regex?

You can use this regex /^[ A-Za-z0-9_@./#&+-]*$/.

What does \\ mean in Java regex?

Backslashes in Java. The backslash \ is an escape character in Java Strings. That means backslash has a predefined meaning in Java. You have to use double backslash \\ to define a single backslash. If you want to define \w , then you must be using \\w in your regex.

2 Answers

Here is a regular expression that allows Latin, Han Chinese, Greek, Russian Cyrillic. It can be completed with more Unicode Scripts.


Demo here: https://regex101.com/r/yCt5xT/1

Here is the full list of Unicode Scripts that can be used: https://www.regular-expressions.info/unicode.html

like image 148
jordiburgos Avatar answered Sep 28 '22 12:09


The challenge is that is composed of surrogate pairs, which the regex engine interprets as code points, not chars.

The solution is to match any letter using \p{L}, but exclude code points of high surrogates on up:


Trying to exclude the unicode characters

"[\\p{L}&&[^\ud000-\uffff]]+" // doesn't work

doesn't work, because the surrogate pairs are merged into a single code point.

Test code:

String[] names = {"ๅฐค้›จๆบช", "Linus", "Gรถdel", "\uD835\uDD6E\uD835\uDD8D\uD835\uDD97\uD835\uDD8E\uD835\uDD98"};

for (String name : names) {
    System.out.println(name + ": " + name.matches("[\\p{L}&&[^\\x{0d000}-\\x{10ffff}]]+"));


ๅฐค้›จๆบช: true
Linus: true
Gรถdel: true
๐•ฎ๐–๐–—๐–Ž๐–˜: false
like image 24
Bohemian Avatar answered Sep 28 '22 12:09
