Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Counting special UTF-8 character

Tags:

c#

I'm finding a way to count special character that form by more than one character but found no solution online!

For e.g. I want to count the string "வாழைப்பழம". It actually consist of 6 tamil character but its 9 character in this case when we use the normal way to find the length. I am wondering is tamil the only kind of encoding that will cause this problem and if there is a solution to this. I'm currently trying to find a solution in C#.

Thank you in advance =)

like image 540
Cheng Avatar asked Jun 15 '12 16:06

Cheng


People also ask

How to count UTF-8 characters in C/C++?

This utf8len () function provides a portable (and small footprint) way of counting UTF-8 charactes in standard C or C++. This test source code has UTF-8 characters, you have to check the source file doesn’t get corrupted when copy/pasting the code.

What is UTF-8 (Unicode)?

UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit. UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one- byte (8-bit) code units.

What is the overlong character in UTF 8?

Overlong encodings. Modified UTF-8 uses the two-byte overlong encoding of U+0000 (the NUL character ), 11000000 10000000 (hexadecimal C0 80 ), instead of 00000000 (hexadecimal 00 ). This allows the byte 00 to be used as a string terminator .

Does this test source code have UTF-8 characters?

This test source code has UTF-8 characters, you have to check the source file doesn’t get corrupted when copy/pasting the code. Average time for functions have a small overhead from the for loop.


2 Answers

Use StringInfo.LengthInTextElements:

var text = "வாழைப்பழம";
Console.WriteLine(text.Length);                               // 9
Console.WriteLine(new StringInfo(text).LengthInTextElements); // 6

The explanation for this behaviour can be found in the documentation of String.Length:

The Length property returns the number of Char objects in this instance, not the number of Unicode characters. The reason is that a Unicode character might be represented by more than one Char. Use the System.Globalization.StringInfo class to work with each Unicode character instead of each Char.

like image 146
Heinzi Avatar answered Sep 20 '22 09:09

Heinzi


A minor nitpick: strings in .NET use UTF-16, not UTF-8


When you're talking about the length of a string, there are several different things you could mean:

  1. Length in bytes.  This is the old C way of looking at things, usually.
  2. Length in Unicode code points.  This gets you closer to the modern times and should be the way how string lengths are treated, except it isn't.
  3. Length in UTF-8/UTF-16 code units.  This is the most common interpretation, deriving from 1. Certain characters take more than one code unit in those encodings which complicates things if you don't expect it.
  4. Count of visible “characters” (graphemes). This is usually what people mean when they say characters or length of a string.

In your case your confusion stems from the difference between 4. and 3. 3. is what C# uses, 4. is what you expect. Complex scripts such as Tamil use ligatures and diacritics. Ligatures are contractions of two or more adjacent characters into a single glyph – in your case ழை is a ligature of ழ and ை – the latter of which changes the appearance of the former; வா is also such a ligature. Diacritics are ornaments around a letter, e.g. the accent in à or the dot above ப்.

The two cases I mentioned both result in a single grapheme (what you perceive as a single character), yet they both need two actual characters each. So you end up with three code points more in the string.

One thing to note: For your case the distinction between 2. and 3. is irrelevant, but generally you should keep it in mind.

like image 29
Joey Avatar answered Sep 21 '22 09:09

Joey