Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split Kannada word into syllabic clusters

We are wondering if there is any method to split a Kannada word to get the syllabic clusters using JavaScript.

For example, I want to split the word ಕನ್ನಡ into the syllabic clusters ["ಕ", "ನ್ನ", "ಡ"]. But when I split it with split, the actual array obtained is ["ಕ", "ನ", "್", "ನ", "ಡ"]

Example Fiddle

like image 923
mpsbhat Avatar asked Jun 01 '17 12:06

mpsbhat


1 Answers

I cannot say that this is a complete solution. But works to an extent with some basic understanding of how words are formed:

var k = 'ಕನ್ನಡ';
var parts = k.split('');
arr = []; 
for(var i=0; i< parts.length; i++) {
  var s = k.charAt(i); 

  // while the next char is not a swara/vyanjana or previous char was a virama 
  while((i+1) < k.length && k.charCodeAt(i+1) < 0xC85 || k.charCodeAt(i+1) > 0xCB9 || k.charCodeAt(i) == 0xCCD) { 
    s += k.charAt(i+1); 
    i++; 
  } 
  arr.push(s);
}
console.log(arr);

As the comments in the code say, we keep appending chars to previous char as long as they are not swara or vyanjana or previous char was a virama. You might have to work with different words to make sure you cover different cases. This particular case doesn't cover the numbers.

For Character codes you can refer to this link: http://www.unicode.org/charts/PDF/U0C80.pdf

like image 3
bugs_cena Avatar answered Nov 02 '22 06:11

bugs_cena