Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

JavaScript regular expression to catch kanji

I can't get this javascript function to work the way I want...

// matches a String that contains kanji and/or kana character(s)

String.prototype.isKanjiKana = function(){
    return !!this.match(/^[\u4E00-\u9FAF|\u3040-\u3096|\u30A1-\u30FA|\uFF66-\uFF9D|\u31F0-\u31FF]+$/);
}

it does return TRUE if the string is made of kanji and/or kana characters, FALSE if alphabet or other chars are present.

I would like it to return if at least 1 kanji and/or kana characters are present instead that if all of them are.

thank you in advance for any help!

like image 742
Mikele Avatar asked Sep 08 '11 07:09

Mikele


4 Answers

The right answer is not to hardcode ranges. Never ever put magic numbers in your code! That is a maintenance nightmare. It is hard to read, hard to write, hard to debug, hard to maintain. How do you know you got the numbers right? What happens when they add new ones? No, do not use magic numbers. Please.

The right answer is to use named Unicode scripts, which are a fundemental aspect of every Unicode code point:

[\p{Han}\p{Hiragana}\p{Katakana}]

That requires the XRegExp plugin for Javascript.

The real problem is that Javascript regexes on their own are too primitive to support Unicode properties — and therefore, to support Unicode. Maybe that was once an acceptable compromise 15 years ago, but today it is nothing less than intolerably negligent, as you yourself have discovered.

You will also miss a few Common code points specified as kana in the new Script Extensions property, but probably no matter. You could just add \p{Common} to the set above.

like image 198
tchrist Avatar answered Nov 13 '22 05:11

tchrist


Now that Unicode property escapes are part of the ES (2018) spec, the following regex can be used natively if the JS engine supports this feature (expanding on @tchrist's answer):

/[\p{Script_Extensions=Han}\p{Script_Extensions=Hiragana}\p{Script_Extensions=Katakana}]/u

If you want to exclude punctuation from being matched:

/(?!\p{Punctuation})[\p{Script_Extensions=Han}\p{Script_Extensions=Hiragana}\p{Script_Extensions=Katakana}]/u
like image 24
Inkling Avatar answered Nov 13 '22 03:11

Inkling


/[\u3000-\u303f]|[\u3040-\u309f]|[\u30a0-\u30ff]|[\uff00-\uffef]|[\u4e00-\u9faf]|[\u3400-\u4dbf]/
  • Japanese style punctuation: [\u3000-\u303f]
  • Hiragana: [\u3040-\u309f]
  • Katakana: [\u30a0-\u30ff]
  • Roman characters + half-width katakana: [\uff00-\uffef]
  • Kanji: [\u4e00-\u9faf]|[\u3400-\u4dbf]
like image 3
Anh Nguyen Avatar answered Nov 13 '22 04:11

Anh Nguyen


String.prototype.isKanjiKana = function(){
    return !!this.match(/[\u4E00-\u9FAF\u3040-\u3096\u30A1-\u30FA\uFF66-\uFF9D\u31F0-\u31FF]/);
}

Don't anchor it to beginning and end of string with $^ and the + is useless in this case.

like image 2
xanatos Avatar answered Nov 13 '22 04:11

xanatos