Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting Unicode string to unicode chars in c# for indian languages

I need to convert unicode string to unicode characters.

for eg:Language Tamil

"கமலி"=>'க','ம','லி'

i'm able to strip unicode bytes but producing unicode characters is became problem.

byte[] stringBytes = Encoding.Unicode.GetBytes("கமலி");
char[] stringChars = Encoding.Unicode.GetChars(stringBytes);
foreach (var crt in stringChars)
 {
     Trace.WriteLine(crt);
 }

it gives result as :

'க'=>0x0b95

'ம'=>0x0bae

'ல'=>0x0bb2

'ி'=>0x0bbf

so here the problem is how to strip character 'லி' as it as 'லி' without splitting like 'ல','ி'.

since it is natural in Indian language by representing consonant and vowel as single characters but parsing with c# make difficulty.

All i need to be split into 3 characters.

like image 304
Arunkumar Chandrasekaran Avatar asked Dec 20 '12 06:12

Arunkumar Chandrasekaran


1 Answers

To iterate over graphemes you can use the methods of the StringInfo class.

Each combination of base character + combining characters is called a 'text element' by the .NET documentation, and you can iterate over them using a TextElementEnumerator:

var str = "கமலி";
var enumerator = System.Globalization.StringInfo.GetTextElementEnumerator(str);
while (enumerator.MoveNext())
{
    Console.WriteLine(enumerator.Current);
}

Output:

க
ம
லி
like image 150
porges Avatar answered Oct 15 '22 06:10

porges