Counting special UTF-8 character

Tags:

c#

I'm finding a way to count special character that form by more than one character but found no solution online!

For e.g. I want to count the string "வாழைப்பழம". It actually consist of 6 tamil character but its 9 character in this case when we use the normal way to find the length. I am wondering is tamil the only kind of encoding that will cause this problem and if there is a solution to this. I'm currently trying to find a solution in C#.

Thank you in advance =)

540

asked Jun 15 '12 16:06

Cheng

2 Answers

Use StringInfo.LengthInTextElements:

Click to copy

var text = "வாழைப்பழம";
Console.WriteLine(text.Length);                               // 9
Console.WriteLine(new StringInfo(text).LengthInTextElements); // 6

The explanation for this behaviour can be found in the documentation of String.Length:

The Length property returns the number of Char objects in this instance, not the number of Unicode characters. The reason is that a Unicode character might be represented by more than one Char. Use the System.Globalization.StringInfo class to work with each Unicode character instead of each Char.

146

answered Sep 20 '22 09:09

Heinzi

A minor nitpick: strings in .NET use UTF-16, not UTF-8

When you're talking about the length of a string, there are several different things you could mean:

Length in bytes. This is the old C way of looking at things, usually.
Length in Unicode code points. This gets you closer to the modern times and should be the way how string lengths are treated, except it isn't.
Length in UTF-8/UTF-16 code units. This is the most common interpretation, deriving from 1. Certain characters take more than one code unit in those encodings which complicates things if you don't expect it.
Count of visible “characters” (graphemes). This is usually what people mean when they say characters or length of a string.

In your case your confusion stems from the difference between 4. and 3. 3. is what C# uses, 4. is what you expect. Complex scripts such as Tamil use ligatures and diacritics. Ligatures are contractions of two or more adjacent characters into a single glyph – in your case ழை is a ligature of ழ and ை – the latter of which changes the appearance of the former; வா is also such a ligature. Diacritics are ornaments around a letter, e.g. the accent in à or the dot above ப்.

The two cases I mentioned both result in a single grapheme (what you perceive as a single character), yet they both need two actual characters each. So you end up with three code points more in the string.

One thing to note: For your case the distinction between 2. and 3. is irrelevant, but generally you should keep it in mind.

answered Sep 21 '22 09:09

Joey

Related questions
                            
                                Design Custom TabPage in C# WinForms?
                            
                                REST Web Services using MVC, is it a good idea?
                            
                                Deserialize Xml with empty elements
                            
                                Is it possible to create a Uri that is not absolute?
                            
                                How should one maintain a database connection in an ASP.NET MVC application?
                            
                                HttpPost on ASP.Net MVC3 - "No parameterless constructor defined for this object" [duplicate]
                            
                                slowcheetah to transform value of an element in config file
                            
                                How to do non-blocking on RabbitMQ?
                            
                                Save XLS Using Interop Excel
                            
                                How to make my code creating dump file before crash?
                            
                                How to generate a random number with 8 digits total in C#? (4 integer, 4 fractional part)
                            
                                Socket doesn't close after application exits if a launched process is open
                            
                                Can I cache partially-executed LINQ queries?
                            
                                How to find out where my error is coming from?
                            
                                C# Enum or int constants
                            
                                ObservableDictionary for c#
                            
                                Is there a C# equivalent of PHP's array_key_exists?
                            
                                What is an ICollection? [closed]
                            
                                MSBuild Task for setting custom attribute in AssemblyInfo.cs
                            
                                Pass C# Values To Javascript

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With