Take following situation:
procedure Test;
var
Response : String;
begin
Response := IdHttp.Post(MyUrL, AStream);
DoSomethingWith(Response);
end;
Now the webserver returns me data in UTF-8. Suppose it returns me some UTF-8 XML containing the character é. If I use the variable Response it does not contain this character but it's UTF-8 variant (#C3#A9), so Indy did no decoding?
Now I know how to solve this problem:
procedure Test;
var
Response : String;
begin
Response := UTF8ToString(IdHttp.Post(MyUrL, AStream));
DoSomethingWith(Response);
end;
One caveat with this solution: Delphi raises warning W1058 (Implicit string cast with potential data loss from 'string' to 'RawByteString')
My question : is this the correct way to deal with this problem or can I instruct TIdHTTP to do the conversion to UnicodeString for me?
If you are using an up-to-date version of Indy 10, then the overloaded version of TIdHTTP.Post()
that returns a String
does decode the data to Unicode, however the actual charset used for the decoding depends on what media type the HTTP Content-Type
response header specifies:
if the media type is either application/xml
, application/xml-external-parsed-entity
, application/xml-dtd
, or is not a text/...
type but does end with +xml
, then the charset specified in the encoding
attribute of the XML's prolog is used. If no charset is specified, UTF-8 is used.
otherwise, if the Content-Type
response header specifies a charset, then it is used.
otherwise, if the media type is a text/...
type, then:
a. if the media type is text/xml
, text/xml-external-parsed-entity
, or ends with +xml
, then us-ascii
is used.
b. otherwise ISO-8859-1
is used.
otherwise, Indy's default encoding (ASCII by default) is used.
Without seeing the actual HTTP Content-Type
header, it is hard to know which condition your situation falls into. It sounds like it is falling into either #2 or #3b, which would account for the UTF-8 byte values being returned as-is, if ISO-8859-1
or similar charset is being used.
UTF8ToString()
expects a UTF-8 encoded RawByteString
as input, but you are passing it a UTF-16 encoded UnicodeString
instead. The RTL will perform a UTF16->Ansi conversion in that situation, using a default Ansi charset for the conversion. That is why you get the compiler warning, because such a conversion can lose data.
XML is really a binary data format, subject to charset encodings. An XML parser needs to know what the XML's encoding is, and be able to parse the raw encoded bytes accordingly. That is why XML has an explicit encoding
attribute right in the XML prolog. However, when TIdHTTP
downloads XML as a String
, although it does automatically decode it to Unicode, it does not yet update the XML's prolog accordingly.
The real solution is to not download XML as a String
in the first place. Download it as a TStream
instead (TMemoryStream
is a better choice than TStringStream
) so your XML parser has access to the original bytes, the original charset declaration, etc. You can pass the TStream
to the TXMLDocument.LoadFromStream()
method, for instance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With