Here is a snippet of the code :
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(request.RawUrl); WebRequest.DefaultWebProxy = null;//Ensure that we will not loop by going again in the proxy HttpWebResponse response = (HttpWebResponse)webRequest.GetResponse(); string charSet = response.CharacterSet; Encoding encoding; if (String.IsNullOrEmpty(charSet)) encoding = Encoding.Default; else encoding = Encoding.GetEncoding(charSet); StreamReader resStream = new StreamReader(response.GetResponseStream(), encoding); return resStream.ReadToEnd();
The problem is if I test with : http://www.google.fr
All "é" are not displaying well. I have try to change ASCII to UTF8 and it still display wrong. I have tested the html file in a browser and the browser display the html text well so I am pretty sure the problem is in the method I use to download the html file.
What should I change?
removed dead ImageShack link
This class contains support for HTTP-specific uses of the properties and methods of the WebResponse class. The HttpWebResponse class is used to build HTTP stand-alone client applications that send HTTP requests and receive HTTP responses.
C# Syntax: [Serializable] public class HttpWebResponse : WebResponse. Remarks. The HttpWebResponse class contains support for the properties and methods included in WebResponse with additional elements that enable the user to interact directly with the HTTP protocol.
HttpWebRequest exposes common HTTP header values sent to the Internet resource as properties, set by methods, or set by the system; the following table contains a complete list. You can set other headers in the Headers property as name/value pairs.
from a quick check in the object browser HttpWebRequest isn't seem to be disposable. it needs to have IDisposable interface.
CharacterSet is "ISO-8859-1" by default, if it is not specified in server's content type header (different from "charset" meta tag in HTML). I compare HttpWebResponse.CharacterSet with charset attribute of HTML. If they are different - I use the charset as specified in HTML to re-read the page again, but with correct encoding this time.
See the code:
string strWebPage = ""; // create request System.Net.WebRequest objRequest = System.Net.HttpWebRequest.Create(sURL); // get response System.Net.HttpWebResponse objResponse; objResponse = (System.Net.HttpWebResponse)objRequest.GetResponse(); // get correct charset and encoding from the server's header string Charset = objResponse.CharacterSet; Encoding encoding = Encoding.GetEncoding(Charset); // read response using (StreamReader sr = new StreamReader(objResponse.GetResponseStream(), encoding)) { strWebPage = sr.ReadToEnd(); // Close and clean up the StreamReader sr.Close(); } // Check real charset meta-tag in HTML int CharsetStart = strWebPage.IndexOf("charset="); if (CharsetStart > 0) { CharsetStart += 8; int CharsetEnd = strWebPage.IndexOfAny(new[] { ' ', '\"', ';' }, CharsetStart); string RealCharset = strWebPage.Substring(CharsetStart, CharsetEnd - CharsetStart); // real charset meta-tag in HTML differs from supplied server header??? if(RealCharset!=Charset) { // get correct encoding Encoding CorrectEncoding = Encoding.GetEncoding(RealCharset); // read the web page again, but with correct encoding this time // create request System.Net.WebRequest objRequest2 = System.Net.HttpWebRequest.Create(sURL); // get response System.Net.HttpWebResponse objResponse2; objResponse2 = (System.Net.HttpWebResponse)objRequest2.GetResponse(); // read response using (StreamReader sr = new StreamReader(objResponse2.GetResponseStream(), CorrectEncoding)) { strWebPage = sr.ReadToEnd(); // Close and clean up the StreamReader sr.Close(); } } }
Firstly, the easier way of writing that code is to use a StreamReader and ReadToEnd:
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(myURL); using (HttpWebResponse response = (HttpWebResponse)webRequest.GetResponse()) { using (Stream resStream = response.GetResponseStream()) { StreamReader reader = new StreamReader(resStream, Encoding.???); return reader.ReadToEnd(); } }
Then it's "just" a matter of finding the right encoding. How did you create the file? If it's with Notepad then you probably want Encoding.Default
- but that's obviously not portable, as it's the default encoding for your PC.
In a well-run web server, the response will indicate the encoding in its headers. Having said that, response headers sometimes claim one thing and the HTML claims another, in some cases.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With