I need to take a string of mixed Asian characters (for now, assume only Chinese kanji or Japanese kanji/hiragana/katakana) and "Alphanumeric" (i.e., Enlgish, French), and count it in the following way:
1) count each Asian CHARACTER as 1; 2) count each Alphanumeric WORD as 1;
a few examples:
株式会社myCompany = 4 chars + 1 word = 5 total 株式会社マイコ = 7 chars
my only idea so far is to use:
var wordArray=val.split(/\w+/);
and then check each element to see if its contents are alphanumeric (so count as 1) or not (so take the array length). But I don't feel that's really very clever at all and the text being counted might be up to 10,000words, so not very quick.
Ideas?
Unfortunately JavaScript's RegExp
has no support for Unicode character classes; \w
only applies to ASCII characters (modulo some browser bugs).
You can use Unicode characters in groups, though, so you can do it if you can isolate each set of characters you are interested in as a range. eg.:
var r= new RegExp(
'[A-Za-z0-9_\]+|'+ // ASCII letters (no accents)
'[\u3040-\u309F]+|'+ // Hiragana
'[\u30A0-\u30FF]+|'+ // Katakana
'[\u4E00-\u9FFF\uF900-\uFAFF\u3400-\u4DBF]', // Single CJK ideographs
'g');
var nwords= str.match(r).length;
(This attempts to give a more realistic count of ‘words’ for Japanese, counting each run of one type of kana as a word. That's still not right, of course, but it's probably closer than treating each syllable as one word.)
Obviously there are many more characters that would have to be accounted for if you wanted to ‘do it properly’. Let's hope you don't have characters outside the basic multilingual plane, for one!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With