Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I correctly deal with UTF-8 in web responses in my C# code?

To preface this, the most I know about text encoding I learned from Joel Spolsky article.

I am currently writing a C# web system to perform a query on our Google Search appliance, read the results and present them to the user in our own custom UI. However, there are encoding issues when I am displaying the text summaries to the users.

When I query the GSA directly in chrome/IE/whatever, I get the following response

Postgame Notes No. 8 seed DePaul vs. No. 9 seed USF Game 6 – Second
Round

In my C# code, I am reading that response with the following code:

        var request = WebRequest.Create(LastQueryUrl);
        var response = (HttpWebResponse)request.GetResponse();

        if (response.StatusCode != HttpStatusCode.OK)
            return null;

        using (var reader = new StreamReader(response.GetResponseStream(), System.Text.Encoding.UTF8))
            content = reader.ReadToEnd();

When I debug the content variable, I see that string converted into:

USF Game 6 � Second

I am 99% sure that the data coming from the GSA is in UTF-8 format due to other points on their xml stating so, as well as from various tidbits in the documentation. Even though, if I read the stream using System.Text.Encoding.Unicode instead none of the text is readable.

What am I doing wrong, and how can I get the text to display correctly?


Edit: using System.Text.Encoding.GetEncoding("ISO-8859-1") gives me

USF Game 6 Second

No question mark, though the dash doesn't show up.

like image 961
KallDrexx Avatar asked Apr 02 '12 18:04

KallDrexx


Video Answer


1 Answers

Could you try executing this code (instead of your using block) and pasting the result again? I’m assuming you’re on .NET 4.

using (var responseStream = response.GetResponseStream())
using (var memoryStream = new MemoryStream())
{
    responseStream.CopyTo(memoryStream);
    byte[] bytes = memoryStream.ToArray();
    content = BitConverter.ToString(bytes);
}

Edit: I notice that you haven’t been pasting the entire returned string in your posts. Is it because the rest of the string contains confidential data? If so, do not paste the result suggested above.

Edit2: To get your result to render correctly, you can use Encoding.GetEncoding(1252); however, I would suggest you don’t do that, for reasons I shall explain soon.

Explanation: From what I’ve figured, your issue seems to be that the sending party is getting their encodings wrong. You say that their documentation claims UTF-8, which is clearly contradicted from their XML declaration of ISO-8859-1. In reality, the encoding used is neither of the two.

In the hex string you uploaded, the culprit character has a byte value of 0x96, and occurs in the middle of the sequence 20-96-20. In both UTF-8 and ISO-8859-1 (as well as ASCII before them), 0x20 is a space character. However, in UTF-8, 0x96 is a continuation byte, and is not valid unless preceded by a start byte (which 0x20 is not). In ISO-8859-1, 0x96 is a C1 control character and, therefore, not a printable character (cannot be displayed to users).

Thus, we may infer that the original character encoding is neither UTF-8 nor ISO-8859-1, but Windows-1252, sometimes considered a “superset” of ISO-8859-1 since it replaces the 0x800x9F range of control characters by displayable characters. In fact, in Windows-1252, 0x96 is the en-dash character you were expecting.

In consideration of the above, it might be safe to resolve your issue by assuming the Windows-1252 encoding; however, if I were you, I would contact the provider and inform them of this flaw.

using (var stream = response.GetResponseStream())
using (var reader = new StreamReader(stream, System.Text.Encoding.GetEncoding(1252)))
   content = reader.ReadToEnd();
like image 127
Douglas Avatar answered Oct 04 '22 15:10

Douglas