I want to store arbitary binary data as BLOB into a SQlite database.
The data will be added as value
with this function:
procedure TSQLiteDatabase.AddParamText(name: string; value: string);
Now I want to convert a WideString
into its UTF8 representation, so it can be stored to the database. After calling UTF8Encode
and storing the result into the database, I noticed that the data inside the database is not UTF8 decoded. Instead, it is encoded as AnsiString in my computer's locale.
I ran following test to check what happened:
type
{$IFDEF Unicode}
TBinary = RawByteString;
{$ELSE}
TBinary = AnsiString;
{$ENDIF}
procedure TForm1.Button1Click(Sender: TObject);
var
original: WideString;
blob: TBinary;
begin
original := 'ä';
blob := UTF8Encode(original);
// Delphi 6: ä (as expected)
// Delphi XE4: ä (unexpected! How did it do an automatic UTF8Decode???)
ShowMessage(blob);
end;
After the character "ä" has been converted to UTF8, the data is correct in memory ("ä"), however, as soon as I pass the TBinary
value to a function (as string
or AnsiString
), Delphi XE4 does a "magic typecast" invoking UTF8Decode for some reason I don't know.
I have already found a workaround to avoid this:
function RealUTF8Encode(AInput: WideString): TBinary;
var
tmp: TBinary;
begin
tmp := UTF8Encode(AInput);
SetLength(result, Length(tmp));
CopyMemory(@result[1], @tmp[1], Length(tmp));
end;
procedure TForm1.Button2Click(Sender: TObject);
var
original: WideString;
blob: TBinary;
begin
original := 'ä';
blob := RealUTF8Encode(original);
// Delphi 6: ä (as expected)
// Delphi XE4: ä (as expected)
ShowMessage(blob);
end;
However, this workaround with RealUTF8Encode
looks dirty to me and I would like to understand why a simple call of UTF8Encode
did not work and if there is a better solution.
By default, Python uses utf-8 encoding.
Use str.Call str. encode() to encode str as UTF-8 bytes. Call bytes. decode() to decode UTF-8 encoded bytes to a Unicode string.
UTF-8 (UCS Transformation Format 8) is the World Wide Web's most common character encoding. Each character is represented by one to four bytes. UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character.
In Ansi versions of Delphi (prior to D2009), UTF8Encode()
returns a UTF-8 encoded AnsiString
. In Unicode versions (D2009 and later), it returns a UTF-8 encoded RawByteString
with a code page of CP_UTF8
(65001) assigned to it.
In Ansi versions, ShowMessage()
takes an AnsiString
as input, and the UTF-8 string is an AnsiString
, so it gets displayed as-is. In Unicode versions, ShowMessage()
takes a UTF-16 encoded UnicodeString
as input, so the UTF-8 encoded RawByteString
gets converted to UTF-16 using its assigned CP-UTF8
code page.
If you actually wrote the blob
data directly to the database you would find that it may or may not be UTF-8 encoded, depending on how you are writing it. But your approach is wrong; the use of RawByteString
is incorrect in this situation. RawByteString
is meant to be used as a procedure parameter only. Do not use it as a local variable. That is the source of your problem. From the documentation:
The purpose of RawByteString is to reduce the need for multiple overloads of procedures that read string data. This means that parameters of routines that process strings without regard for the string's code page should typically be of type RawByteString.
RawByteString should only be used as a parameter type, and only in routines which otherwise would need multiple overloads for AnsiStrings with different codepages. Such routines need to be written with care for the actual codepage of the string at run time.
For Unicode versions of Delphi, instead of RawByteString
, I would suggest that you use TBytes
to hold your UTF-8 data, and encode it with TEncoding
:
var
utf8: TBytes;
str: string;
...
str := ...;
utf8 := TEncoding.UTF8.GetBytes(str);
You are looking for a data type that does not perform implicit text encodings when passed around, and TBytes
is that type.
For Ansi versions of Delphi, you can use AnsiString
, WideString
and UTF8Encode
exactly as you do.
Personally however, I would recommend using TBytes
consistently for your UTF-8 data. So if you need a single code base that supports Ansi and Unicode compilers (ugh!) then you should create some helpers:
{$IFDEF Unicode}
function GetUTF8Bytes(const Value: string): TBytes;
begin
Result := TEncoding.UTF8.GetBytes(Value);
end;
{$ELSE}
function GetUTF8Bytes(const Value: WideString): TBytes;
var
utf8str: UTF8String;
begin
utf8str := UTF8Encode(Value);
SetLength(Result, Length(utf8str));
Move(Pointer(utf8str)^, Pointer(Result)^, Length(utf8str));
end;
{$ENDIF}
The Ansi version incurs more heap allocations than are necessary. You might well choose to write a more efficient helper that calls WideCharToMultiByte()
directly.
In Unicode versions of Delphi, if for some reason you don't want to use TBytes
for UTF-8 data, you can use UTF8String
instead. This is a special AnsiString
that always uses the CP_UTF8
code page. You can then write:
var
utf8: UTF8String;
str: string;
....
utf8 := str;
and the compiler will convert from UTF-16 to UTF-8 behind the scenes for you. I would not recommend this though, because it is not supported on mobile platforms, or in Ansi versions of Delphi (UTF8String
has existed since Delphi 6, but it was not a true UTF-8 string until Delphi 2009). That is, amongst other reasons, why I suggest that you use TBytes
. My philosophy is, at least in the Unicode age, that there is the native string
type, and any other encoding should be held in TBytes
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With