Why is the length of this string longer than the number of characters in it?

People also ask

Why is the size of character arrays declared one more than the largest string they can hold?

The maximum index value of most arrays, therefore, is one less than its numerical value. It's same with a string, but since it has an extra character at the end, it gets incremented by one. So, the string length is the same as the number of characters in it.

What does the length of the string represent?

The Length function returns the number of characters in an input string.

How do you determine the length of a string?

To calculate the length of a string in Java, you can use an inbuilt length() method of the Java string class. In Java, strings are objects created using the string class and the length() method is a public member method of this class.

Everyone else is giving the surface answer, but there's a deeper rationale too: the number of "characters" is a difficult-to-define question and can be surprisingly expensive to compute, whereas a length property should be fast.

Why is it difficult to define? Well, there's a few options and none are really more valid than another:

The number of code units (bytes or other fixed size data chunk; C# and Windows typically use UTF-16 so it returns the number of two-byte pieces) is certainly relevant, as the computer still needs to deal with the data in that form for many purposes (writing to a file, for example, cares about bytes rather than characters)
The number of Unicode codepoints is fairly easy to compute (although O(n) because you gotta scan the string for surrogate pairs) and might matter to a text editor.... but isn't actually the same thing as the number of characters printed on screen (called graphemes). For example, some accented letters can be represented in two forms: a single codepoint, or two points paired together, one representing the letter, and one saying "add an accent to my partner letter". Would the pair be two characters or one? You can normalize strings to help with this, but not all valid letters have a single codepoint representation.
Even the number of graphemes isn't the same as the length of a printed string, which depends on the font among other factors, and since some characters are printed with some overlap in many fonts (kerning), the length of a string on screen is not necessarily equal to the sum of the length of graphemes anyway!
Some Unicode points aren't even characters in the traditional sense, but rather some kind of control marker. Like a byte order marker or a right-to-left indicator. Do these count?

In short, the length of a string is actually a ridiculously complex question and calculating it can take a lot of CPU time as well as data tables.

Moreover, what's the point? Why does these metrics matter? Well, only you can answer that for your case, but personally, I find they are generally irrelevant. Limiting data entry I find is more logically done by byte limits, as that's what needs to be transferred or stored anyway. Limiting display size is better done by the display side software - if you have 100 pixels for the message, how many characters you fit depends on the font, etc., which isn't known by the data layer software anyway. Finally, given the complexity of the unicode standard, you're probably going to have bugs at the edge cases anyway if you try anything else.

So it is a hard question with not a lot of general purpose use. Number of code units is trivial to calculate - it is just the length of the underlying data array - and the most meaningful/useful as a general rule, with a simple definition.

That's why b has length 4 beyond the surface explanation of "because the documentation says so".

From the documentation of the String.Length property:

The Length property returns the number of Char objects in this instance, not the number of Unicode characters. The reason is that a Unicode character might be represented by more than one Char. Use the System.Globalization.StringInfo class to work with each Unicode character instead of each Char.

Your character at index 1 in "A𠈓C" is a SurrogatePair

The key point to remember is that surrogate pairs represent 32-bit single characters.

You can try this code and it will return True

Console.WriteLine(char.IsSurrogatePair("A𠈓C", 1));

Char.IsSurrogatePair Method (String, Int32)

true if the s parameter includes adjacent characters at positions index and index + 1, and the numeric value of the character at position index ranges from U+D800 through U+DBFF, and the numeric value of the character at position index+1 ranges from U+DC00 through U+DFFF; otherwise, false.

This is further explained in String.Length property:

The Length property returns the number of Char objects in this instance, not the number of Unicode characters. The reason is that a Unicode character might be represented by more than one Char. Use the System.Globalization.StringInfo class to work with each Unicode character instead of each Char.

As the other answers have pointed out, even if there are 3 visible character they are represented with 4 char objects. Which is why the Length is 4 and not 3.

MSDN states that

The Length property returns the number of Char objects in this instance, not the number of Unicode characters.

However if what you really want to know is the number of "text elements" and not the number of Char objects you can use the StringInfo class.

var si = new StringInfo("A𠈓C");
Console.WriteLine(si.LengthInTextElements); // 3

You can also enumerate each text element like this

var enumerator = StringInfo.GetTextElementEnumerator("A𠈓C");
while(enumerator.MoveNext()){
    Console.WriteLine(enumerator.Current);
}

Using foreach on the string will split the middle "letter" in two char objects and the printed result won't correspond to the string.

That is because the Length property returns the number of char objects, not the number of unicode characters. In your case, one of the Unicode characters is represented by more than one char object (SurrogatePair).

The Length property returns the number of Char objects in this instance, not the number of Unicode characters. The reason is that a Unicode character might be represented by more than one Char. Use the System.Globalization.StringInfo class to work with each Unicode character instead of each Char.

As others said, it's not the number of characters in the string but the number of Char objects. The character 𠈓 is code point U+20213. Since the value is outside 16-bit char type's range, it's encoded in UTF-16 as the surrogate pair D840 DE13.

The way to get the length in characters was mentioned in the other answers. However it should be use with care as there can be many ways to represent a character in Unicode. "à" may be 1 composed character or 2 characters (a + diacritics). Normalization may be needed like in the case of twitter.

You should read this
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Related questions
                            
                                Overriding == operator. How to compare to null? [duplicate]
                            
                                How do I keep a label centered in WinForms?
                            
                                Is a LINQ statement faster than a 'foreach' loop?
                            
                                Deserializing JSON data to C# using JSON.NET
                            
                                Deserialize from string instead TextReader
                            
                                Web API Put Request generates an Http 405 Method Not Allowed error
                            
                                How to run a C# console application with the console hidden
                            
                                Does inverting the "if" improve performance? [duplicate]
                            
                                Why does the C# compiler translate this != comparison as if it were a > comparison?
                            
                                How to create a simple proxy in C#?
                            
                                Difference Between Invoke and DynamicInvoke
                            
                                Delegates: Predicate vs. Action vs. Func
                            
                                C# switch statement limitations - why?
                            
                                C#: Raising an inherited event
                            
                                How to add extension methods to Enums
                            
                                Kill some processes by .exe file name
                            
                                Select Multiple Fields from List in Linq
                            
                                How to set conditional breakpoints in Visual Studio?
                            
                                Get SQL code from an Entity Framework Core IQueryable<T>
                            
                                How to access property of anonymous type in C#?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is the length of this string longer than the number of characters in it?

Tags:

string

c#

.net

unicode

unicode-string

People also ask

Recent Activity

Donate For Us