Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cast from RawByteString to string does automatically invoke UTF8Decode?

I want to store arbitary binary data as BLOB into a SQlite database.

The data will be added as value with this function:

procedure TSQLiteDatabase.AddParamText(name: string; value: string);

Now I want to convert a WideString into its UTF8 representation, so it can be stored to the database. After calling UTF8Encode and storing the result into the database, I noticed that the data inside the database is not UTF8 decoded. Instead, it is encoded as AnsiString in my computer's locale.

I ran following test to check what happened:

type
  {$IFDEF Unicode}
  TBinary = RawByteString;
  {$ELSE}
  TBinary = AnsiString;
  {$ENDIF}

procedure TForm1.Button1Click(Sender: TObject);
var
  original: WideString;
  blob: TBinary;
begin
  original := 'ä';
  blob     := UTF8Encode(original);

  // Delphi 6:   ä (as expected)
  // Delphi XE4: ä  (unexpected! How did it do an automatic UTF8Decode???)
  ShowMessage(blob);
end;

After the character "ä" has been converted to UTF8, the data is correct in memory ("ä"), however, as soon as I pass the TBinary value to a function (as string or AnsiString), Delphi XE4 does a "magic typecast" invoking UTF8Decode for some reason I don't know.

I have already found a workaround to avoid this:

function RealUTF8Encode(AInput: WideString): TBinary;
var
  tmp: TBinary;
begin
  tmp := UTF8Encode(AInput);
  SetLength(result, Length(tmp));
  CopyMemory(@result[1], @tmp[1], Length(tmp));
end;

procedure TForm1.Button2Click(Sender: TObject);
var
  original: WideString;
  blob: TBinary;
begin
  original := 'ä';
  blob     := RealUTF8Encode(original);

  // Delphi 6:   ä (as expected)
  // Delphi XE4: ä (as expected)
  ShowMessage(blob);
end;

However, this workaround with RealUTF8Encode looks dirty to me and I would like to understand why a simple call of UTF8Encode did not work and if there is a better solution.

like image 393
Daniel Marschall Avatar asked Jun 05 '14 10:06

Daniel Marschall


People also ask

Is Python a utf8 string?

By default, Python uses utf-8 encoding.

How do I encode utf8 in Python?

Use str.Call str. encode() to encode str as UTF-8 bytes. Call bytes. decode() to decode UTF-8 encoded bytes to a Unicode string.

What is the meaning of UTF-8?

UTF-8 (UCS Transformation Format 8) is the World Wide Web's most common character encoding. Each character is represented by one to four bytes. UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character.


1 Answers

In Ansi versions of Delphi (prior to D2009), UTF8Encode() returns a UTF-8 encoded AnsiString. In Unicode versions (D2009 and later), it returns a UTF-8 encoded RawByteString with a code page of CP_UTF8 (65001) assigned to it.

In Ansi versions, ShowMessage() takes an AnsiString as input, and the UTF-8 string is an AnsiString, so it gets displayed as-is. In Unicode versions, ShowMessage() takes a UTF-16 encoded UnicodeString as input, so the UTF-8 encoded RawByteString gets converted to UTF-16 using its assigned CP-UTF8 code page.

If you actually wrote the blob data directly to the database you would find that it may or may not be UTF-8 encoded, depending on how you are writing it. But your approach is wrong; the use of RawByteString is incorrect in this situation. RawByteString is meant to be used as a procedure parameter only. Do not use it as a local variable. That is the source of your problem. From the documentation:

The purpose of RawByteString is to reduce the need for multiple overloads of procedures that read string data. This means that parameters of routines that process strings without regard for the string's code page should typically be of type RawByteString.

RawByteString should only be used as a parameter type, and only in routines which otherwise would need multiple overloads for AnsiStrings with different codepages. Such routines need to be written with care for the actual codepage of the string at run time.

For Unicode versions of Delphi, instead of RawByteString, I would suggest that you use TBytes to hold your UTF-8 data, and encode it with TEncoding:

var
  utf8: TBytes;
  str: string;
...
str := ...;
utf8 := TEncoding.UTF8.GetBytes(str);

You are looking for a data type that does not perform implicit text encodings when passed around, and TBytes is that type.

For Ansi versions of Delphi, you can use AnsiString, WideString and UTF8Encode exactly as you do.

Personally however, I would recommend using TBytes consistently for your UTF-8 data. So if you need a single code base that supports Ansi and Unicode compilers (ugh!) then you should create some helpers:

{$IFDEF Unicode}
function GetUTF8Bytes(const Value: string): TBytes;
begin
  Result := TEncoding.UTF8.GetBytes(Value);
end;
{$ELSE}
function GetUTF8Bytes(const Value: WideString): TBytes;
var
  utf8str: UTF8String;
begin
  utf8str := UTF8Encode(Value);
  SetLength(Result, Length(utf8str));
  Move(Pointer(utf8str)^, Pointer(Result)^, Length(utf8str));
end;
{$ENDIF}

The Ansi version incurs more heap allocations than are necessary. You might well choose to write a more efficient helper that calls WideCharToMultiByte() directly.

In Unicode versions of Delphi, if for some reason you don't want to use TBytes for UTF-8 data, you can use UTF8String instead. This is a special AnsiString that always uses the CP_UTF8 code page. You can then write:

var
  utf8: UTF8String;
  str: string;
....
utf8 := str;

and the compiler will convert from UTF-16 to UTF-8 behind the scenes for you. I would not recommend this though, because it is not supported on mobile platforms, or in Ansi versions of Delphi (UTF8String has existed since Delphi 6, but it was not a true UTF-8 string until Delphi 2009). That is, amongst other reasons, why I suggest that you use TBytes. My philosophy is, at least in the Unicode age, that there is the native string type, and any other encoding should be held in TBytes.

like image 60
David Heffernan Avatar answered Sep 30 '22 04:09

David Heffernan