To preface this, the most I know about text encoding I learned from Joel Spolsky article.
I am currently writing a C# web system to perform a query on our Google Search appliance, read the results and present them to the user in our own custom UI. However, there are encoding issues when I am displaying the text summaries to the users.
When I query the GSA directly in chrome/IE/whatever, I get the following response
Postgame Notes No. 8 seed DePaul vs. No. 9 seed USF Game 6 – Second
Round
In my C# code, I am reading that response with the following code:
var request = WebRequest.Create(LastQueryUrl);
var response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode != HttpStatusCode.OK)
return null;
using (var reader = new StreamReader(response.GetResponseStream(), System.Text.Encoding.UTF8))
content = reader.ReadToEnd();
When I debug the content
variable, I see that string converted into:
USF Game 6 � Second
I am 99% sure that the data coming from the GSA is in UTF-8 format due to other points on their xml stating so, as well as from various tidbits in the documentation. Even though, if I read the stream using System.Text.Encoding.Unicode
instead none of the text is readable.
What am I doing wrong, and how can I get the text to display correctly?
System.Text.Encoding.GetEncoding("ISO-8859-1")
gives me
USF Game 6 Second
No question mark, though the dash doesn't show up.
Could you try executing this code (instead of your using
block) and pasting the result again? I’m assuming you’re on .NET 4.
using (var responseStream = response.GetResponseStream())
using (var memoryStream = new MemoryStream())
{
responseStream.CopyTo(memoryStream);
byte[] bytes = memoryStream.ToArray();
content = BitConverter.ToString(bytes);
}
Edit: I notice that you haven’t been pasting the entire returned string in your posts. Is it because the rest of the string contains confidential data? If so, do not paste the result suggested above.
Edit2: To get your result to render correctly, you can use Encoding.GetEncoding(1252)
; however, I would suggest you don’t do that, for reasons I shall explain soon.
Explanation: From what I’ve figured, your issue seems to be that the sending party is getting their encodings wrong. You say that their documentation claims UTF-8, which is clearly contradicted from their XML declaration of ISO-8859-1. In reality, the encoding used is neither of the two.
In the hex string you uploaded, the culprit character has a byte value of 0x96
, and occurs in the middle of the sequence 20-96-20
. In both UTF-8 and ISO-8859-1 (as well as ASCII before them), 0x20
is a space character. However, in UTF-8, 0x96
is a continuation byte, and is not valid unless preceded by a start byte (which 0x20
is not). In ISO-8859-1, 0x96
is a C1 control character and, therefore, not a printable character (cannot be displayed to users).
Thus, we may infer that the original character encoding is neither UTF-8 nor ISO-8859-1, but Windows-1252, sometimes considered a “superset” of ISO-8859-1 since it replaces the 0x80
–0x9F
range of control characters by displayable characters. In fact, in Windows-1252, 0x96
is the en-dash character you were expecting.
In consideration of the above, it might be safe to resolve your issue by assuming the Windows-1252 encoding; however, if I were you, I would contact the provider and inform them of this flaw.
using (var stream = response.GetResponseStream())
using (var reader = new StreamReader(stream, System.Text.Encoding.GetEncoding(1252)))
content = reader.ReadToEnd();
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With