I'm finding a way to count special character that form by more than one character but found no solution online!
For e.g. I want to count the string "வாழைப்பழம". It actually consist of 6 tamil character but its 9 character in this case when we use the normal way to find the length. I am wondering is tamil the only kind of encoding that will cause this problem and if there is a solution to this. I'm currently trying to find a solution in C#.
Thank you in advance =)
This utf8len () function provides a portable (and small footprint) way of counting UTF-8 charactes in standard C or C++. This test source code has UTF-8 characters, you have to check the source file doesn’t get corrupted when copy/pasting the code.
UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit. UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one- byte (8-bit) code units.
Overlong encodings. Modified UTF-8 uses the two-byte overlong encoding of U+0000 (the NUL character ), 11000000 10000000 (hexadecimal C0 80 ), instead of 00000000 (hexadecimal 00 ). This allows the byte 00 to be used as a string terminator .
This test source code has UTF-8 characters, you have to check the source file doesn’t get corrupted when copy/pasting the code. Average time for functions have a small overhead from the for loop.
Use StringInfo.LengthInTextElements
:
var text = "வாழைப்பழம";
Console.WriteLine(text.Length); // 9
Console.WriteLine(new StringInfo(text).LengthInTextElements); // 6
The explanation for this behaviour can be found in the documentation of String.Length:
The Length property returns the number of Char objects in this instance, not the number of Unicode characters. The reason is that a Unicode character might be represented by more than one Char. Use the
System.Globalization.StringInfo
class to work with each Unicode character instead of each Char.
A minor nitpick: string
s in .NET use UTF-16, not UTF-8
When you're talking about the length of a string, there are several different things you could mean:
In your case your confusion stems from the difference between 4. and 3. 3. is what C# uses, 4. is what you expect. Complex scripts such as Tamil use ligatures and diacritics. Ligatures are contractions of two or more adjacent characters into a single glyph – in your case ழை is a ligature of ழ and ை – the latter of which changes the appearance of the former; வா is also such a ligature. Diacritics are ornaments around a letter, e.g. the accent in à or the dot above ப்.
The two cases I mentioned both result in a single grapheme (what you perceive as a single character), yet they both need two actual characters each. So you end up with three code points more in the string.
One thing to note: For your case the distinction between 2. and 3. is irrelevant, but generally you should keep it in mind.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With