In Delphi 2007 you can store a UTF-8 string in a WideString and then pass that onto a Win32 function, e.g. <pre class="prettyprint"><code>var UnicodeStr: WideString; UTF8Str: WideString; begin UnicodeStr:='some unicode text'; UTF8Str:=UTF8Encode(UnicodeStr); Windows.SomeFunction(PWideChar(UTF8Str), ...) end; </code></pre> Delphi 2007 does not interfere with the contents of UTF8Str, i.e. it is left as a UTF-8 encoded string stored in a WideString. But in Delphi 2010 I'm struggling to find a way to do the same thing, i.e. store a UTF-8 encoded string in a WideString without it being automatically converted from UTF-8. I cannot pass a pointer to a UTF-8 string (or RawByteString), e.g. the following will obviously not work: <pre class="prettyprint"><code>var UnicodeStr: WideString; UTF8Str: UTF8String; begin UnicodeStr:='some unicode text'; UTF8Str:=UTF8Encode(UnicodeStr); Windows.SomeFunction(PWideChar(UTF8Str), ...) end; </code></pre>

Your original Delphi 2007 code was converting the UTF-8 string to a widestring using the ANSI codepage. To do the same thing in Delphi 2010 you should use SetCodePage with the Convert parameter false. <pre class="prettyprint"><code>var UnicodeStr: UnicodeString; UTF8Str: RawByteString; begin UTF8Str := UTF8Encode('some unicode text'); SetCodePage(UTF8Str, 0, False); UnicodeStr := UTF8Str; Windows.SomeFunction(PWideChar(UnicodeStr), ...) </code></pre>

Storing UTF-8 string in a UnicodeString

Tags:

string

unicode

utf-8

utf-16

delphi

In Delphi 2007 you can store a UTF-8 string in a WideString and then pass that onto a Win32 function, e.g.

var
  UnicodeStr: WideString;
  UTF8Str: WideString;
begin
  UnicodeStr:='some unicode text';
  UTF8Str:=UTF8Encode(UnicodeStr);
  Windows.SomeFunction(PWideChar(UTF8Str), ...)
end;

Delphi 2007 does not interfere with the contents of UTF8Str, i.e. it is left as a UTF-8 encoded string stored in a WideString.

But in Delphi 2010 I'm struggling to find a way to do the same thing, i.e. store a UTF-8 encoded string in a WideString without it being automatically converted from UTF-8. I cannot pass a pointer to a UTF-8 string (or RawByteString), e.g. the following will obviously not work:

var
  UnicodeStr: WideString;
  UTF8Str: UTF8String;
begin
  UnicodeStr:='some unicode text';
  UTF8Str:=UTF8Encode(UnicodeStr);
  Windows.SomeFunction(PWideChar(UTF8Str), ...)
end;

410

asked Apr 23 '10 10:04

Mick

2 Answers

Your original Delphi 2007 code was converting the UTF-8 string to a widestring using the ANSI codepage. To do the same thing in Delphi 2010 you should use SetCodePage with the Convert parameter false.

var
  UnicodeStr: UnicodeString;
  UTF8Str: RawByteString;
begin
  UTF8Str := UTF8Encode('some unicode text');
  SetCodePage(UTF8Str, 0, False);
  UnicodeStr := UTF8Str;
  Windows.SomeFunction(PWideChar(UnicodeStr), ...)

110

answered Oct 21 '22 16:10

Zoë Peterson

Hmm, why are you doing that? Why are you encoding a WideString to UTF-8 just to store it again back to WideString. You are obviously using a Unicode version of the Windows API. So there is no need to use a UTF-8-encoded string. Or am I missing something.

Because Windows API functions are either Unicode (two bytes) or ANSI (one byte). UTF-8 would be wrong choice here, because mainly it contains one byte per character, but for characters above the ASCII base it uses two or more bytes.

Otherwise the equivalent for your old code in unicode Delphi would be:

var
  UnicodeStr: string;
  UTF8Str: string;
begin
  UnicodeStr:='some unicode text';
  UTF8Str:=UTF8Encode(UnicodeStr);
  Windows.SomeFunction(PWideChar(UTF8Str), ...)
end;

WideString and string (UnicodeString) are similar, but the new UnicodeString is faster because it is reference-counted and WideString is not.

You code was not correct because the UTF-8 string has a variable number of bytes per character. "A" is stored as one byte. Just an ASCII byte code. "ü" on the other hand would be stored as two bytes. And because you are then using PWideChar the function always expects two bytes per character.

There is another difference. In older Delphi versions (ANSI) Utf8String was just an AnsiString. In Unicode versions of Delphi Utf8String is a string with a UTF-8 code page behind it. So it behaves differently.

The old code would still work correctly:

var
  UnicodeStr: WideString;
  UTF8Str: WideString;
begin
  UnicodeStr:='some unicode text';
  UTF8Str:=UTF8Encode(UnicodeStr);
  Windows.SomeFunction(PWideChar(UTF8Str), ...)
end;

It would act the same as it did in Delphi 2007. So maybe you have a problem elsewhere.

Mick you are correct. The compiler does some extra work behind the scenes. So in order to avoid this you can do something like this:

var
  UTF8Str: AnsiString;
  UnicodeStr: WideString;
  TempString: RawByteString;
  ResultString: WideString;
begin
  UnicodeStr := 'some unicode text';
  TempString := UTF8Encode(UnicodeStr);
  SetLength(UTF8Str, Length(TempString));
  Move(TempString[1], UTF8Str[1], Length(UTF8Str));
  ResultString := UTF8Str;
end;

I checked, and it works just the same. Because I move bytes directly in memory there is no codepage conversion done in the background. I am sure it can be done with greater eleganece, but the point is that I see this as the way for what you want to achieve.

answered Oct 21 '22 14:10

Runner

Related questions
                            
                                How to assign value to string using VB.NET?
                            
                                Ambiguous string::operator= call for type with implicit conversion to int and string
                            
                                String that came from Request.Url.ToString() misteriously changes to another string when manipulating/comparing the first characters
                            
                                How to properly quit a program in python
                            
                                Expression evaluating to None when substr is not found
                            
                                Parallel algorithm for constructing a trie?
                            
                                Get words around a position in a string
                            
                                String pointer and array of chars in c
                            
                                covert a string which is a list into a proper list python
                            
                                Pandas delete parts of string after specified character inside a dataframe
                            
                                CPython string addition optimisation failure case
                            
                                Move constructor for std::string from char*
                            
                                Android strings.xml resource - arabic language and dynamic formatted strings
                            
                                Differences between arrays, pointers and strings in C
                            
                                Rust String concatenation [duplicate]
                            
                                What happens if String Pool runs out of memory?
                            
                                Count number of rows when row contains certain text
                            
                                str encoding from latin-1 to utf-8 arbitrarily
                            
                                Check for valid domain name in a string?
                            
                                Create a list of objects with initialized properties from a string with infos

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With