Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count number of characters present in foreign language

Is there any optimal way to implement character count for non English letters? For example, if we take the word "Mother" in English, it is a 6 letter word. But if you type the same word(மதர்) in Tamil, it is a three letter word(ம+த+ர்) but the last letter(ர்) will be considered as two characters(ர+ஂ=ர்) by the system. So is there any way to count the number of real characters?

One clue is that if we move the cursor in keyboard into the word (மதர்), it will pass through 3 letters only and not into 4 chars considering by the system, so is there any way to find the solution by using this? Any help on this would be greatly appreciated...

like image 435
Stranger Avatar asked Dec 11 '12 07:12

Stranger


People also ask

How do you count number of characters?

You can get a character count in a Word document by selecting the "Review" tab and clicking "Word Count." You can find both the number of characters with spaces and the character count not including spaces. You can add the Word Count dialog box to the Quick Access toolbar so it's always one click away.

How many sentences is 4000 characters?

4,000 characters is about 28-67 sentences. A sentence typically has 15–20 words.


1 Answers

Update

Back from lunch =) I'm afraid that the previous won't work this well with any foreign language So i added another fiddle with a possible way

var UnicodeNsm = [Array 1280] //It holds all escaped Unicode Non Space Marks
function countNSMString(str) {
    var chars = str.split("");
    var count = 0;
    for (var i = 0,ilen = chars.length;i<ilen;i++) {
      if(UnicodeNsm.indexOf(escape(chars[i])) == -1) {
        count++;
       }
    }
    return count;
}

var English = "Mother";  
var Tamil = "மதர்";
var Vietnamese = "mẹ"
var Hindi = "मां"

function logL (str) {    
      console.log(str + " has " + countNSMString(str) + " visible Characters and " + str.length + " normal Characters" ); //"மதர் has 3 visible Characters"
}

logL(English) //"Mother has 6 visible Characters and 6 normal Characters"
logL(Tamil) //"மதர் has 3 visible Characters and 4 normal Characters"
logL(Vietnamese) //"mẹ has 2 visible Characters and 3 normal Characters"
logL(Hindi) //"मां has 1 visible Characters and 3 normal Characters"

So this just checks if theres any Character in the String which is a Unicode NSM character and ignores the count for this, this should work for the Most languages, not Tamil only, And an array with 1280 Elements shouldn't be that big of a performance issue

Here is a list with the Unicode NSM's http://www.fileformat.info/info/unicode/category/Mn/list.htm

Here is the according JSBin


After experimenting a bit with string operations, it turns out String.indexOf returns the same for

"ர்" and for "ர" meaning
"ர்ரர".indexOf("ர்") == "ர்ரர".indexOf("ர" + "்") //true but
"ர்ரர".indexOf("ர") == "ர்ரர".indexOf("ர" + "ர") //false

I took this opportunity and tried something like this

//ர்

var char = "ரர்ர்ரர்்";
var char2 = "ரரர்ர்ரர்்";    
var char3 = "ர்ரர்ர்ரர்்";

function countStr(str) {
         var  chars = str.split("");
         var count = 0;
          for(var i = 0, ilen = chars.length;i<ilen;i++) {
                 var chars2 = chars[i] + chars[i+1];   
                 if (str.indexOf(chars[i]) == str.indexOf(chars2))
                   i += 1;
               count++;
            }
         return count;
 }


console.log("--");

console.log(countStr(char)); //6

console.log(countStr(char2)); //7

console.log(countStr(char3)); //7

Which seems to work for the String above, it may take some adjustments, as i don't know a thing about Encoding and stuff, but maybe its a point you can begin with

Heres the JSBin

like image 152
Moritz Roessler Avatar answered Oct 05 '22 23:10

Moritz Roessler