I have an issue with some content that we are downloading from the web for a screen scraping tool that I am building.
in the code below, the string returned from the web client download string method returns some odd characters for the source download for a few (not all) web sites.
I have recently added http headers as below. Previously the same code was called without the headers to the same effect. I have not tried variations on the 'Accept-Charset' header, I don't know much about text encoding other than the basics.
The characters, or character sequences that I refer to are:
""
and
"Â"
These characters are not seen when you use "view source" in a web browser. What could be causing this and how can I rectify the problem?
string urlData = String.Empty; WebClient wc = new WebClient(); // Add headers to impersonate a web browser. Some web sites // will not respond correctly without these headers wc.Headers.Add("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12"); wc.Headers.Add("Accept", "*/*"); wc.Headers.Add("Accept-Language", "en-gb,en;q=0.5"); wc.Headers.Add("Accept-Charset", "ISO-8859-1,utf-8;q=0.7,*;q=0.7"); urlData = wc.DownloadString(uri);
This method retrieves the specified resource. After it downloads the resource, the method uses the encoding specified in the Encoding property to convert the resource to a String. This method blocks while downloading the resource.
In a nutshell, WebRequest—in its HTTP-specific implementation, HttpWebRequest—represents the original way to consume HTTP requests in . NET Framework. WebClient provides a simple but limited wrapper around HttpWebRequest. And HttpClient is the new and improved way of doing HTTP requests and posts, having arrived with .

is the windows-1252 representation of the octets EF BB BF
. That's the UTF-8 byte-order marker, which implies that your remote web page is encoded in UTF-8 but you're reading it as if it were windows-1252. According to the docs, WebClient.DownloadString
uses Webclient.Encoding
as its encoding when it converts the remote resource into a string. Set it to System.Text.Encoding.UTF8
and things should theoretically work.
The way WebClient.DownloadString
is implemented is very dumb. It should get the character encoding from the Content-Type
header in the response, but instead it expects the developer to tell the expected encoding beforehand. I don't know what the developers of this class were thinking.
I have created an auxiliary class that retrieves the encoding name from the Content-Type
header of the response:
public static class WebUtils { public static Encoding GetEncodingFrom( NameValueCollection responseHeaders, Encoding defaultEncoding = null) { if(responseHeaders == null) throw new ArgumentNullException("responseHeaders"); //Note that key lookup is case-insensitive var contentType = responseHeaders["Content-Type"]; if(contentType == null) return defaultEncoding; var contentTypeParts = contentType.Split(';'); if(contentTypeParts.Length <= 1) return defaultEncoding; var charsetPart = contentTypeParts.Skip(1).FirstOrDefault( p => p.TrimStart().StartsWith("charset", StringComparison.InvariantCultureIgnoreCase)); if(charsetPart == null) return defaultEncoding; var charsetPartParts = charsetPart.Split('='); if(charsetPartParts.Length != 2) return defaultEncoding; var charsetName = charsetPartParts[1].Trim(); if(charsetName == "") return defaultEncoding; try { return Encoding.GetEncoding(charsetName); } catch(ArgumentException ex) { throw new UnknownEncodingException( charsetName, "The server returned data in an unknown encoding: " + charsetName, ex); } } }
(UnknownEncodingException
is a custom exception class, feel free to replace for InvalidOperationException
or whatever else if you want)
Then the following extension method for the WebClient
class will do the trick:
public static class WebClientExtensions { public static string DownloadStringAwareOfEncoding(this WebClient webClient, Uri uri) { var rawData = webClient.DownloadData(uri); var encoding = WebUtils.GetEncodingFrom(webClient.ResponseHeaders, Encoding.UTF8); return encoding.GetString(rawData); } }
So in your example you would do:
urlData = wc.DownloadStringAwareOfEncoding(uri);
...and that's it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With