Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing UTF8 JSON response from server

I am facing a weird problem parsing a JSON response from my server. It has been working fine for the last months when getting the response (with Content-Type: text/html) this way:

string response = "";
using (var client = new System.Net.Http.HttpClient())
{
    var postData = new System.Net.Http.FormUrlEncodedContent(data);
    var clientResult = await client.PostAsync(url, postData);
    if(clientResult.IsSuccessStatusCode)
    {
        response = await clientResult.Content.ReadAsStringAsync();
    }
}
//Parse the response to a JObject...

But when receiving a response with Content-Type: text/html; charset=utf8 it throws an exception that Content-Type is invalid.

Exception message: The character set provided in ContentType is invalid. Cannot read content as string using an invalid character set.

So I changed this:

response = await clientResult.Content.ReadAsStringAsync();

to this:

var raw_response = await clientResult.Content.ReadAsByteArrayAsync();
response = Encoding.UTF8.GetString(raw_response, 0, raw_response.Length);

Now I can get the response without no exceptions but when parsing it, it throws a parsing exception. While debugging I got this: (I changed the response to a shorter one for testing purposes)

var r1 = await clientResult.Content.ReadAsStringAsync();
var r2 = Encoding.UTF8.GetString(await clientResult.Content.ReadAsByteArrayAsync(), 0, raw_response.Length);
System.Diagnostics.Debug.WriteLine("Length: {0} - {1}", r1.Length, r1);
System.Diagnostics.Debug.WriteLine("Length: {0} - {1}", r2.Length, r2);

//Output
Length: 38 - {"version":1,"specialword":"C\u00e3o"}
Length: 39 - {"version":1,"specialword":"C\u00e3o"}

The JSON response format seems correct in both cases but the length is different and I could't figured out why. When copying this to notepad++ to spot hidden characters a ? appeared from nowhere.

Length: 38 - {"version":1,"specialword":"C\u00e3o"}
Length: 39 - ?{"version":1,"specialword":"C\u00e3o"}

This ? is obviously throwing the parsing exception but I don't know why Encoding.UTF8.GetString is causing that.

I have been battling with this for the last hours and I really need some help.

like image 573
letiagoalves Avatar asked Feb 15 '23 16:02

letiagoalves


1 Answers

Well, I'm surprised that you're getting that behavior, I would have expected Encoding.UTF8.GetString to have handled that for you.

What you're seeing, the character value 0xFEFF, is a byte order mark ("BOM"). A BOM is unnecessary in UTF-8 because the byte order is not variable, but it is allowed, as a marker that the following text is encoded UTF-8. (The actual byte sequence is EF BB BF, but then when that's decoded in UTF-8, it becomes code point FEFF.)

If you create your own UTF8Encoding instance, you can tell it whether to include or exclude the BOM. (I think I'm mistaken about that, it may only control whether it includes one when encoding .)

Alternately, you could explicitly test for that and remove the BOM if present, e.g.:

var r2 = Encoding.UTF8.GetString(await clientResult.Content.ReadAsByteArrayAsync(), 0, raw_response.Length);
if (r2[0] == '\uFEFF') {
    r2 = r2.Substring(1);
}
like image 56
T.J. Crowder Avatar answered Feb 18 '23 14:02

T.J. Crowder