Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use regular expression to validate Chinese input?

The thing is I need to treat this kind of Chinese input as invalid in client side validation:

Input is invalid when any English character mixed with any Chinese character and spaces has a total length >=10.

Let's say : "你的a你的a你的a你" or "你的 你的 你的 你" (length is 10) is invalid. But "你的a你的a你的a" (length is 9) is OK.

I am using both Javascript to do client side validation and Java to do the server side. So I suppose applying the regular expression on both should be perfect.

Can anyone give some hints how to write the rules in regular expression?

like image 857
jm li Avatar asked Oct 18 '16 04:10

jm li


1 Answers

From What's the complete range for Chinese characters in Unicode?, the CJK unicode ranges are:

Block                                   Range       Comment
--------------------------------------- ----------- ----------------------------------------------------
CJK Unified Ideographs                  4E00-9FFF   Common
CJK Unified Ideographs Extension A      3400-4DBF   Rare
CJK Unified Ideographs Extension B      20000-2A6DF Rare, historic
CJK Unified Ideographs Extension C      2A700–2B73F Rare, historic
CJK Unified Ideographs Extension D      2B740–2B81F Uncommon, some in current use
CJK Unified Ideographs Extension E      2B820–2CEAF Rare, historic
CJK Compatibility Ideographs            F900-FAFF   Duplicates, unifiable variants, corporate characters
CJK Compatibility Ideographs Supplement 2F800-2FA1F Unifiable variants
CJK Symbols and Punctuation             3000-303F

You probably want to allow code points from the Unicode blocks CJK Unified Ideographs and CJK Unified Ideographs Extension A.

This regex will match 0 to 9 spaces, ideographic spaces (U+3000), A-Z letters, or code points in those 2 CJK blocks.

/^[ A-Za-z\u3000\u3400-\u4DBF\u4E00-\u9FFF]{0,9}$/

The ideographs are listed in:

  • part 1
  • part 2
  • part 3
  • part 4
  • Extension A

However, you may as well add more blocks.


Code:

function has10OrLessCJK(text) {
    return /^[ A-Za-z\u3000\u3400-\u4DBF\u4E00-\u9FFF]{0,9}$/.test(text);
}

function checkValidation(value) {
    var valid = document.getElementById("valid");
    if (has10OrLessCJK(value)) {
        valid.innerText = "Valid";
    } else {
        valid.innerText = "Invalid";
    }
}
<input type="text" 
       style="width:100%"
       oninput="checkValidation(this.value)"
       value="你的a你的a你的a">

<div id="valid">
    Valid
</div>
like image 82
Mariano Avatar answered Oct 10 '22 08:10

Mariano