Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Encoding trouble with HttpWebResponse

Tags:

c#

encoding

Here is a snippet of the code :

HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(request.RawUrl); WebRequest.DefaultWebProxy = null;//Ensure that we will not loop by going again in the proxy HttpWebResponse response = (HttpWebResponse)webRequest.GetResponse(); string charSet = response.CharacterSet; Encoding encoding; if (String.IsNullOrEmpty(charSet)) encoding = Encoding.Default; else encoding = Encoding.GetEncoding(charSet);  StreamReader resStream = new StreamReader(response.GetResponseStream(), encoding); return resStream.ReadToEnd(); 

The problem is if I test with : http://www.google.fr

All "é" are not displaying well. I have try to change ASCII to UTF8 and it still display wrong. I have tested the html file in a browser and the browser display the html text well so I am pretty sure the problem is in the method I use to download the html file.

What should I change?

removed dead ImageShack link

Update 1: Code and test file changed

like image 775
Patrick Desjardins Avatar asked Oct 22 '08 21:10

Patrick Desjardins


People also ask

What is HttpWebResponse?

This class contains support for HTTP-specific uses of the properties and methods of the WebResponse class. The HttpWebResponse class is used to build HTTP stand-alone client applications that send HTTP requests and receive HTTP responses.

What is HttpWebResponse C#?

C# Syntax: [Serializable] public class HttpWebResponse : WebResponse. Remarks. The HttpWebResponse class contains support for the properties and methods included in WebResponse with additional elements that enable the user to interact directly with the HTTP protocol.

How HttpWebRequest works?

HttpWebRequest exposes common HTTP header values sent to the Internet resource as properties, set by methods, or set by the system; the following table contains a complete list. You can set other headers in the Headers property as name/value pairs.

Is HttpWebRequest disposable?

from a quick check in the object browser HttpWebRequest isn't seem to be disposable. it needs to have IDisposable interface.


2 Answers

CharacterSet is "ISO-8859-1" by default, if it is not specified in server's content type header (different from "charset" meta tag in HTML). I compare HttpWebResponse.CharacterSet with charset attribute of HTML. If they are different - I use the charset as specified in HTML to re-read the page again, but with correct encoding this time.

See the code:

    string strWebPage = "";     // create request     System.Net.WebRequest objRequest = System.Net.HttpWebRequest.Create(sURL);     // get response     System.Net.HttpWebResponse objResponse;     objResponse = (System.Net.HttpWebResponse)objRequest.GetResponse();     // get correct charset and encoding from the server's header     string Charset = objResponse.CharacterSet;     Encoding encoding = Encoding.GetEncoding(Charset);     // read response     using (StreamReader sr =             new StreamReader(objResponse.GetResponseStream(), encoding))     {         strWebPage = sr.ReadToEnd();         // Close and clean up the StreamReader         sr.Close();     }      // Check real charset meta-tag in HTML     int CharsetStart = strWebPage.IndexOf("charset=");     if (CharsetStart > 0)     {         CharsetStart += 8;         int CharsetEnd = strWebPage.IndexOfAny(new[] { ' ', '\"', ';' }, CharsetStart);         string RealCharset =                 strWebPage.Substring(CharsetStart, CharsetEnd - CharsetStart);          // real charset meta-tag in HTML differs from supplied server header???         if(RealCharset!=Charset)         {             // get correct encoding             Encoding CorrectEncoding = Encoding.GetEncoding(RealCharset);              // read the web page again, but with correct encoding this time             //   create request             System.Net.WebRequest objRequest2 = System.Net.HttpWebRequest.Create(sURL);             //   get response             System.Net.HttpWebResponse objResponse2;             objResponse2 = (System.Net.HttpWebResponse)objRequest2.GetResponse();             //   read response             using (StreamReader sr =                new StreamReader(objResponse2.GetResponseStream(), CorrectEncoding))             {                 strWebPage = sr.ReadToEnd();                 // Close and clean up the StreamReader                 sr.Close();             }         }     } 
like image 113
Alex Dubinsky Avatar answered Oct 02 '22 15:10

Alex Dubinsky


Firstly, the easier way of writing that code is to use a StreamReader and ReadToEnd:

HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(myURL); using (HttpWebResponse response = (HttpWebResponse)webRequest.GetResponse()) {     using (Stream resStream = response.GetResponseStream())     {         StreamReader reader = new StreamReader(resStream, Encoding.???);         return reader.ReadToEnd();     } } 

Then it's "just" a matter of finding the right encoding. How did you create the file? If it's with Notepad then you probably want Encoding.Default - but that's obviously not portable, as it's the default encoding for your PC.

In a well-run web server, the response will indicate the encoding in its headers. Having said that, response headers sometimes claim one thing and the HTML claims another, in some cases.

like image 28
Jon Skeet Avatar answered Oct 02 '22 15:10

Jon Skeet