In java (v11) I would like to allow all characters in any language for choosing a username, so ASCII, Latin, Greek, Chinese and so on. We tried the pattern <code>\p{IsAlphabetic}</code>. But with this pattern names like "𝕮𝖍𝖗𝖎𝖘" are allowed. I don't want to let people to style their name with such unicode characters. I want him to enter "Chris" and not "𝕮𝖍𝖗𝖎𝖘" It should be allowed to name yourself "尤雨溪", "Linus" or "Gödel". How to achieve a proper Regex not allowing strange styles in names?

Here is a regular expression that allows Latin, Han Chinese, Greek, Russian Cyrillic. It can be completed with more Unicode Scripts. <pre class="prettyprint"><code>^(\p{sc=Han}+|\p{sc=Latin}+|\p{sc=Greek}+|\p{sc=Cyrillic})$ </code></pre> Demo here: https://regex101.com/r/yCt5xT/1 Here is the full list of Unicode Scripts that can be used: https://www.regular-expressions.info/unicode.html <pre class="prettyprint"><code>\p{Common} \p{Arabic} \p{Armenian} \p{Bengali} \p{Bopomofo} \p{Braille} \p{Buhid} \p{Canadian_Aboriginal} \p{Cherokee} \p{Cyrillic} \p{Devanagari} \p{Ethiopic} \p{Georgian} \p{Greek} \p{Gujarati} \p{Gurmukhi} \p{Han} \p{Hangul} \p{Hanunoo} \p{Hebrew} \p{Hiragana} \p{Inherited} \p{Kannada} \p{Katakana} \p{Khmer} \p{Lao} \p{Latin} \p{Limbu} \p{Malayalam} \p{Mongolian} \p{Myanmar} \p{Ogham} \p{Oriya} \p{Runic} \p{Sinhala} \p{Syriac} \p{Tagalog} \p{Tagbanwa} \p{TaiLe} \p{Tamil} \p{Telugu} \p{Thaana} \p{Thai} \p{Tibetan} \p{Yi} </code></pre>

The challenge is that is composed of surrogate pairs, which the regex engine interprets as code points, not chars. The solution is to match any letter using <code>\p{L}</code>, but exclude code points of high surrogates on up: <pre class="prettyprint"><code>"[\\p{L}&&[^\\x{0d000}-\\x{10ffff}]]+" </code></pre> <hr> Trying to exclude the unicode characters <pre class="prettyprint"><code>"[\\p{L}&&[^\ud000-\uffff]]+" // doesn't work </code></pre> doesn't work, because the surrogate pairs are merged into a single code point. <hr> Test code: <pre class="prettyprint"><code>String[] names = {"尤雨溪", "Linus", "Gödel", "\uD835\uDD6E\uD835\uDD8D\uD835\uDD97\uD835\uDD8E\uD835\uDD98"}; for (String name : names) { System.out.println(name + ": " + name.matches("[\\p{L}&&[^\\x{0d000}-\\x{10ffff}]]+")); } </code></pre> Output: <pre class="prettyprint"><code>尤雨溪: true Linus: true Gödel: true 𝕮𝖍𝖗𝖎𝖘: false </code></pre>

Java Regex - Allow all regular Unicode characters for names but not obscure variants

2 Answers

Here is a regular expression that allows Latin, Han Chinese, Greek, Russian Cyrillic. It can be completed with more Unicode Scripts.

Click to copy

^(\p{sc=Han}+|\p{sc=Latin}+|\p{sc=Greek}+|\p{sc=Cyrillic})$

Demo here: https://regex101.com/r/yCt5xT/1

Here is the full list of Unicode Scripts that can be used: https://www.regular-expressions.info/unicode.html

Click to copy

\p{Common}
\p{Arabic}
\p{Armenian}
\p{Bengali}
\p{Bopomofo}
\p{Braille}
\p{Buhid}
\p{Canadian_Aboriginal}
\p{Cherokee}
\p{Cyrillic}
\p{Devanagari}
\p{Ethiopic}
\p{Georgian}
\p{Greek}
\p{Gujarati}
\p{Gurmukhi}
\p{Han}
\p{Hangul}
\p{Hanunoo}
\p{Hebrew}
\p{Hiragana}
\p{Inherited}
\p{Kannada}
\p{Katakana}
\p{Khmer}
\p{Lao}
\p{Latin}
\p{Limbu}
\p{Malayalam}
\p{Mongolian}
\p{Myanmar}
\p{Ogham}
\p{Oriya}
\p{Runic}
\p{Sinhala}
\p{Syriac}
\p{Tagalog}
\p{Tagbanwa}
\p{TaiLe}
\p{Tamil}
\p{Telugu}
\p{Thaana}
\p{Thai}
\p{Tibetan}
\p{Yi}

148

answered Sep 28 '22 12:09

jordiburgos

The challenge is that is composed of surrogate pairs, which the regex engine interprets as code points, not chars.

The solution is to match any letter using \p{L}, but exclude code points of high surrogates on up:

Click to copy

"[\\p{L}&&[^\\x{0d000}-\\x{10ffff}]]+"

Trying to exclude the unicode characters

Click to copy

"[\\p{L}&&[^\ud000-\uffff]]+" // doesn't work

doesn't work, because the surrogate pairs are merged into a single code point.

Test code:

Click to copy

String[] names = {"尤雨溪", "Linus", "Gödel", "\uD835\uDD6E\uD835\uDD8D\uD835\uDD97\uD835\uDD8E\uD835\uDD98"};

for (String name : names) {
    System.out.println(name + ": " + name.matches("[\\p{L}&&[^\\x{0d000}-\\x{10ffff}]]+"));
}

Output:

Click to copy

尤雨溪: true
Linus: true
Gödel: true
𝕮𝖍𝖗𝖎𝖘: false

answered Sep 28 '22 12:09

Bohemian

Related questions
                            
                                Configure jetty that is launched via CXF programmatically
                            
                                Stream rows from PostgreSQL (with fetch size)
                            
                                Why is Scala building its own ForkJoinPool instead of using java.util.concurrent.ForkJoinPool#commonPool? [duplicate]
                            
                                TabView: Have some of the tabs on the left side, and some of them on the right (Space in between)
                            
                                Jshell crashes when i press BackSpace button in windows cmd
                            
                                Spring boot Oauth2 Facebook login - JSON parse error: Cannot deserialize instance of `java.lang.String` out of START_OBJECT token
                            
                                Problem with duplicate class com.google.android.gms Android Studio
                            
                                Running powermock + mockito on java 11 http client
                            
                                Java 8 timezone API get next transition not returning DST change in Moscow 1991
                            
                                Is there a specific way to give a certain subclass some functions of the superclass?
                            
                                Glassfish 5 creates empty temporary copy of EAR file on deploy
                            
                                Multithreaded Segmented Sieve of Eratosthenes in Java
                            
                                Secure keys in Java file like API keys, etc from hackers
                            
                                When does Local.getCountry() return a UN M.49 3-digit code instead of an ISO 3166 2-letter code?
                            
                                No Next and Previous button while importing project in Intellij after intellij Idea 2019.2 update
                            
                                Split huge file of integers (in one line) into sorted chunks with memory restriction
                            
                                Inconsistent exception details in parallel stream
                            
                                Spring Boot 2.2 , JPA > Unable to locate Attribute if second letter is capitalized
                            
                                Why ExoPlayer in Android OS-10 shows black preview screen while preview
                            
                                Can't figure out how to use Prettier Plugin in .prettierrc

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Java Regex - Allow all regular Unicode characters for names but not obscure variants

Tags:

java

regex

Janning

People also ask

2 Answers

jordiburgos

Bohemian

Recent Activity

Donate For Us