Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to do word counts for a mixture of English and Chinese in Javascript

I want to count the number of words in a passage that contains both English and Chinese. For English, it's simple. Each word is a word. For Chinese, we count each character as a word. Therefore, 香港人 is three words here.

So for example, "I am a 香港人" should have a word count of 6.

Any idea how can I count it in Javascript/jQuery?

Thanks!

like image 583
user2335065 Avatar asked Dec 05 '13 09:12

user2335065


3 Answers

Try a regex like this:

/[\u00ff-\uffff]|\S+/g

For example, "I am a 香港人".match(/[\u00ff-\uffff]|\S+/g) gives:

["I", "am", "a", "香", "港", "人"]

Then you can just check the length of the resulting array.

The \u00ff-\uffff part of the regex is a unicode character range; you probably want to narrow this down to just the characters you want to count as words. For example, CJK Unified would be \u4e00-\u9fcc.

function countWords(str) {
    var matches = str.match(/[\u00ff-\uffff]|\S+/g);
    return matches ? matches.length : 0;
}
like image 166
Dagg Nabbit Avatar answered Sep 21 '22 03:09

Dagg Nabbit


It can't be 6, because when you calculate length of a string it includes spaces too. So,

var d = "I am a 香港人";
d.length //returns 10
d.replace(/\s+/g, "").length  //returns 7, excluding spaces

FYI: Your site should be properly encoded.

I think I found what you need. "I am a 香港人" this contains a repeated twice. So With the help of @PSL 's answer, I found a way.

var d = "I am a 香港人";
var uniqueList=d.replace(/\s+/g, '').split('').filter(function(item,i,allItems){
    return i==allItems.indexOf(item);
}).join('');
console.log(uniqueList.length);  //returns 6

JSFiddle

As you comments, I assume you sentence as "I am a 香 港 人" space between each word. Now I altered the code

var d = "I am a 香 港 人";

var uniqueList=d.split(' ').filter(function(item,i,allItems){
    return i==allItems.indexOf(item);
});
console.log(uniqueList.length);  //returns 6

JSFiddle

like image 42
Praveen Avatar answered Sep 23 '22 03:09

Praveen


I have tried the script, but it will sometimes wrongly count the number of words. For example, some people will type "香港人computing都不錯的", but the script will count it as 4 words (using the following script).

<script>
var str = "香港人computing都不錯的";

  var matches = str.match(/[\u00ff-\uffff]|\S+/g);
    x= matches ? matches.length : 0;
    alert(x)
</script>

To fix the problem, I have changed the codes to:

<script>
var str="香港人computing都不錯的";

/// fix problem in special characters such as middle-dot, etc.   
str= str.replace(/[\u007F-\u00FE]/g,' ');

/// make a duplicate first...
var str1=str;
var str2=str;

/// the following remove all chinese characters and then count the number of english characters in the string
str1=str1.replace(/[^!-~\d\s]+/gi,' ')

/// the following remove all english characters and then count the number of chinese characters in the string
str2=str2.replace(/[!-~\d\s]+/gi,'')


var matches1 = str1.match(/[\u00ff-\uffff]|\S+/g);
var matches2 = str2.match(/[\u00ff-\uffff]|\S+/g);


count1= matches1 ? matches1.length : 0;
count2= matches2 ? matches2.length : 0;

/// return the total of the mixture
var lvar1= (count1+count2);

alert(lvar1);
</script>

Now the script counts the number of words in a mixture of chinese and english correctly.... Enjoy..

like image 32
Ken Lee Avatar answered Sep 23 '22 03:09

Ken Lee