Length() vs Sizeof() on Unicode strings

Tags:

delphi-xe8

Quoting the Delphi XE8 help:

For single-byte and multibyte strings, Length returns the number of bytes used by the string. Example for UTF-8:
   Writeln(Length(Utf8String('1¢'))); // displays 3
For Unicode (WideString) strings, Length returns the number of bytes divided by two.

This arises important questions:

Why the difference in handling is there at all?
Why Length() doesn't do what it's expected to do, return just the length of the parameter (as in, the count of elements) instead of giving the size in bytes in some cases?
Why does it state it divides the result by 2 for Unicode (UTF-16) strings? AFAIK UTF-16 is 4-byte at most, and thus this will give incorrect results.

421

asked Jun 03 '15 12:06

1 Answers

Length returns the number of elements when considering the string as an array.

For strings with 8 bit element types (ANSI, UTF-8) then Length gives you the number of bytes since the number of bytes is the same as the number of elements.
For strings with 16 bit elements (UTF-16) then Length is half the number of bytes because each element is 2 bytes wide.

Your string '1¢' has two code points, but the second code point requires two bytes to encode it in UTF-8. Hence Length(Utf8String('1¢')) evaluates to three.

You mention SizeOf in the question title. Passing a string variable to SizeOf will always return the size of a pointer, since a string variable is, under the hood, just a pointer.

To your specific questions:

Why the difference in handling is there at all?

There is only a difference if you think of Length as relating to bytes. But that's the wrong way to think about it Length always returns an element count, and when viewed that way, there behaviour is uniform across all string types, and indeed across all array types.

Why Length() doesn't do what it's expected to do, return just the length of the parameter (as in, the count of elements) instead of giving the size in bytes in some cases?

It does always return the element count. It just so happens that when the element size is a single byte, then the element count and the byte count happen to be the same. In fact the documentation that you refer to also contains the following just above the excerpt that you provided: Returns the number of characters in a string or of elements in an array. That is the key text. The excerpt that you included is meant as an illustration of the implications of this italicised text.

Why does it state it divides the result by 2 for Unicode (UTF-16) strings? AFAIK UTF-16 is 4-byte at most, and thus this will give incorrect results.

UTF-16 character elements are always 16 bits wide. However, some Unicode code points require two character elements to encode. These pairs of character elements are known as surrogate pairs.

You are hoping, I think, that Length will return the number of code points in a string. But it doesn't. It returns the number of character elements. And for variable length encodings, the number of code points is not necessarily the same as the number of character elements. If your string was encoded as UTF-32 then the number of code points would be the same as the number of character elements since UTF-32 is a constant sized encoding.

A quick way to count the code points is to scan through the string checking for surrogate pairs. When you encounter a surrogate pair, count one code point. Otherwise, when you encounter a character element that is not part of a surrogate pair, count one code point. In pseudo-code:

N := 0;
for C in S do
  if C.IsSurrogate then
    inc(N)
  else
    inc(N, 2);
CodePointCount := N div 2;

Another point to make is that the code point count is not the same as the visible character count. Some code points are combining characters and are combined with their neighbouring code points to form a single visible character or glyph.

Finally, if all you are hoping to do is find the byte size of the string payload, use this expression:

Length(S) * SizeOf(S[1])

This expression works for all types of string.

Be very careful about the function System.SysUtils.ByteLength. On the face of it this seems to be just what you want. However, that function returns the byte length of a UTF-16 encoded string. So if you pass it an AnsiString, say, then the value returned by ByteLength is twice the number of bytes of the AnsiString.

answered Oct 30 '22 16:10

David Heffernan

Related questions
                            
                                ListView color items at runtime
                            
                                Is there a way to change a local typed constant from *outside* the routine it's declared in?
                            
                                How to probe the availability of Intel® Advanced Vector Extensions?
                            
                                How to correctly retrieve battery serial number?
                            
                                Adding a hot-key to my Delphi app
                            
                                Delphi TIdhttp Post JSON?
                            
                                Can "unused" classes be made available in Delphi XE
                            
                                Draw a line on canvas with custom style (delphi)
                            
                                Delphi 2010 forms shows on "wrong" monitor
                            
                                Pasting multiple lines into a TEdit
                            
                                How to get the memory size of a dynamic array?
                            
                                How to receive TAB key press in edit box?
                            
                                Is it possible to get the index of class property?
                            
                                Print to a non default printer in delphi
                            
                                Delphi TListBox OnClick / OnChange?
                            
                                Does clearing an image by assigning Image1.Picture := nil; cause a memory leak?
                            
                                "Pentium-safe FDIV" ... in year 2014?
                            
                                Is this lambda? if not what is it?
                            
                                Delphi - Get current index of selected item in TListView
                            
                                Add lines to the top of a memo in Delphi

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Length() vs Sizeof() on Unicode strings

Tags:

delphi

delphi-xe8

ZzZombo

People also ask

1 Answers

David Heffernan

Recent Activity

Donate For Us