What does the .NET String.Length property return? Surrogate neutral length or complete character length

Tags:

The documentation and language varies between VS 2008 and 2010:

VS 2008 Documentation

Internally, the text is stored as a readonly collection of Char objects, each of which represents one Unicode character encoded in UTF-16. ... The length of a string represents the number of characters regardless of whether the characters are formed from Unicode surrogate pairs or not. To access the individual Unicode code points in a string, use the StringInfo object. - http://msdn.microsoft.com/en-us/library/ms228362%28v=vs.90%29.aspx

VS 2010 Documentation

Internally, the text is stored as a sequential read-only collection of Char objects. ... The Length property of a string represents the number of Char objects it contains, not the number of Unicode characters. To access the individual Unicode code points in a string, use the StringInfo object. - http://msdn.microsoft.com/en-us/library/ms228362%28v=VS.100%29.aspx

The language used in both cases doesn't clearly differentiate between "character", "Unicode character", "Char class", "Unicode surrogate pair", and "Unicode code point".

The language in the VS2008 documentation stating that a "string represents the number of characters regardless of whether the characters are formed from Unicode surrogate pairs or not" seems to be defining "character" as as object that may be the result of a Unicode surrogate pair, which suggests that it may represent a 4-byte sequence rather than a 2-byte sequence. It also specifically states at the beginning that a "char" object is encoded in UTF-16, which suggests that it could represent a surrogate pair (being 4 bytes instead of 2). I'm fairly certain that is wrong though.

The VS2010 documentation is a little more precise. It draws a distinction between "char" and "Unicode character", but not between "Unicode character" and "Unicode code point". If a code point refers to half a surrogate pair, and a "Unicode character" represents a full pair, then the "Char" class is named incorrectly, and does not refer to a "Unicode character" at all (which they state it does not), and it's really a Unicode code point.

So are both of the following statements true? (Yes, I think.)

String.Length represents the Unicode code-point length, and
String.Length represents neither the Unicode character length nor what we would consider to be a true character length (number of characters that would be displayed), but rather the number of "Char" objects, which each represent a Unicode code point (not a Unicode character).

300

asked Apr 13 '11 22:04

Triynko

1 Answers

String.Length does not account for surrogate pairs; however, the StringInfo.LengthInTextElements method does.

StringInfo.SubstringByTextElements is similar to String.Substring, but it operates on "Text Elements", such as surrogate pairs and combining characters, as well as normal characters. The functionality of both these methods are based on the StringInfo.ParseCombiningCharacters method, which extracts the starting index of each text element and stores them in a private array.

"The .NET Framework defines a text element as a unit of text that is displayed as a single character, that is, a grapheme. A text element can be a base character, a surrogate pair, or a combining character sequence." - http://msdn.microsoft.com/en-us/library/system.globalization.stringinfo.aspx

answered Sep 19 '22 09:09

Triynko

Related questions
                            
                                What is Predicate Dispatch
                            
                                How do I declare an array variable in VBA?
                            
                                Setting Up Devise & Sendgrid on Heroku
                            
                                When to use `zipmap` and when `map vector`?
                            
                                How to get text from each cell of an HTML table?
                            
                                Can you add custom compiler warnings in Objective-C?
                            
                                jQuery plugins vs widgets
                            
                                How to get X509Certificate from certificate store and generate xml signature data?
                            
                                Non-virtual interface design pattern in C#/C++
                            
                                Convert &apos; to an apostrophe in PHP
                            
                                INSTALL_FAILED_DEXOPT error when trying to install application
                            
                                Localize iPhone App name in Xcode 4

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What does the .NET String.Length property return? Surrogate neutral length or complete character length

Tags:

VS 2008 Documentation

VS 2010 Documentation

Triynko

People also ask

1 Answers

Triynko

Recent Activity

Donate For Us