Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TIdHTTP character encoding of POST response

Take following situation:

procedure Test;

var
 Response : String;

begin
 Response := IdHttp.Post(MyUrL, AStream);
 DoSomethingWith(Response);
end;

Now the webserver returns me data in UTF-8. Suppose it returns me some UTF-8 XML containing the character é. If I use the variable Response it does not contain this character but it's UTF-8 variant (#C3#A9), so Indy did no decoding?

Now I know how to solve this problem:

procedure Test;

var
 Response : String;

begin
 Response := UTF8ToString(IdHttp.Post(MyUrL, AStream));
 DoSomethingWith(Response);
end;

One caveat with this solution: Delphi raises warning W1058 (Implicit string cast with potential data loss from 'string' to 'RawByteString')

My question : is this the correct way to deal with this problem or can I instruct TIdHTTP to do the conversion to UnicodeString for me?

like image 831
whosrdaddy Avatar asked Sep 16 '13 15:09

whosrdaddy


1 Answers

If you are using an up-to-date version of Indy 10, then the overloaded version of TIdHTTP.Post() that returns a String does decode the data to Unicode, however the actual charset used for the decoding depends on what media type the HTTP Content-Type response header specifies:

  1. if the media type is either application/xml, application/xml-external-parsed-entity, application/xml-dtd, or is not a text/... type but does end with +xml, then the charset specified in the encoding attribute of the XML's prolog is used. If no charset is specified, UTF-8 is used.

  2. otherwise, if the Content-Type response header specifies a charset, then it is used.

  3. otherwise, if the media type is a text/... type, then:

    a. if the media type is text/xml, text/xml-external-parsed-entity, or ends with +xml, then us-ascii is used.

    b. otherwise ISO-8859-1 is used.

  4. otherwise, Indy's default encoding (ASCII by default) is used.

Without seeing the actual HTTP Content-Type header, it is hard to know which condition your situation falls into. It sounds like it is falling into either #2 or #3b, which would account for the UTF-8 byte values being returned as-is, if ISO-8859-1 or similar charset is being used.

UTF8ToString() expects a UTF-8 encoded RawByteString as input, but you are passing it a UTF-16 encoded UnicodeString instead. The RTL will perform a UTF16->Ansi conversion in that situation, using a default Ansi charset for the conversion. That is why you get the compiler warning, because such a conversion can lose data.

XML is really a binary data format, subject to charset encodings. An XML parser needs to know what the XML's encoding is, and be able to parse the raw encoded bytes accordingly. That is why XML has an explicit encoding attribute right in the XML prolog. However, when TIdHTTP downloads XML as a String, although it does automatically decode it to Unicode, it does not yet update the XML's prolog accordingly.

The real solution is to not download XML as a String in the first place. Download it as a TStream instead (TMemoryStream is a better choice than TStringStream) so your XML parser has access to the original bytes, the original charset declaration, etc. You can pass the TStream to the TXMLDocument.LoadFromStream() method, for instance.

like image 117
Remy Lebeau Avatar answered Oct 02 '22 21:10

Remy Lebeau