Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

WebClient.DownloadString() returns string with peculiar characters

I have an issue with some content that we are downloading from the web for a screen scraping tool that I am building.

in the code below, the string returned from the web client download string method returns some odd characters for the source download for a few (not all) web sites.

I have recently added http headers as below. Previously the same code was called without the headers to the same effect. I have not tried variations on the 'Accept-Charset' header, I don't know much about text encoding other than the basics.

The characters, or character sequences that I refer to are:

""

and

"Â"

These characters are not seen when you use "view source" in a web browser. What could be causing this and how can I rectify the problem?

string urlData = String.Empty; WebClient wc = new WebClient();  // Add headers to impersonate a web browser. Some web sites  // will not respond correctly without these headers wc.Headers.Add("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12"); wc.Headers.Add("Accept", "*/*"); wc.Headers.Add("Accept-Language", "en-gb,en;q=0.5"); wc.Headers.Add("Accept-Charset", "ISO-8859-1,utf-8;q=0.7,*;q=0.7");  urlData = wc.DownloadString(uri); 
like image 736
gb2d Avatar asked Jan 17 '11 18:01

gb2d


People also ask

What does WebClient downloadstring do?

This method retrieves the specified resource. After it downloads the resource, the method uses the encoding specified in the Encoding property to convert the resource to a String. This method blocks while downloading the resource.

What is difference between HttpClient and WebClient?

In a nutshell, WebRequest—in its HTTP-specific implementation, HttpWebRequest—represents the original way to consume HTTP requests in . NET Framework. WebClient provides a simple but limited wrapper around HttpWebRequest. And HttpClient is the new and improved way of doing HTTP requests and posts, having arrived with .


2 Answers

 is the windows-1252 representation of the octets EF BB BF. That's the UTF-8 byte-order marker, which implies that your remote web page is encoded in UTF-8 but you're reading it as if it were windows-1252. According to the docs, WebClient.DownloadString uses Webclient.Encoding as its encoding when it converts the remote resource into a string. Set it to System.Text.Encoding.UTF8 and things should theoretically work.

like image 118
dkarp Avatar answered Sep 29 '22 19:09

dkarp


The way WebClient.DownloadString is implemented is very dumb. It should get the character encoding from the Content-Type header in the response, but instead it expects the developer to tell the expected encoding beforehand. I don't know what the developers of this class were thinking.

I have created an auxiliary class that retrieves the encoding name from the Content-Type header of the response:

public static class WebUtils {     public static Encoding GetEncodingFrom(         NameValueCollection responseHeaders,         Encoding defaultEncoding = null)     {         if(responseHeaders == null)             throw new ArgumentNullException("responseHeaders");          //Note that key lookup is case-insensitive         var contentType = responseHeaders["Content-Type"];         if(contentType == null)             return defaultEncoding;          var contentTypeParts = contentType.Split(';');         if(contentTypeParts.Length <= 1)             return defaultEncoding;          var charsetPart =             contentTypeParts.Skip(1).FirstOrDefault(                 p => p.TrimStart().StartsWith("charset", StringComparison.InvariantCultureIgnoreCase));         if(charsetPart == null)             return defaultEncoding;          var charsetPartParts = charsetPart.Split('=');         if(charsetPartParts.Length != 2)             return defaultEncoding;          var charsetName = charsetPartParts[1].Trim();         if(charsetName == "")             return defaultEncoding;          try         {             return Encoding.GetEncoding(charsetName);         }         catch(ArgumentException ex)          {             throw new UnknownEncodingException(                 charsetName,                    "The server returned data in an unknown encoding: " + charsetName,                  ex);         }     } } 

(UnknownEncodingException is a custom exception class, feel free to replace for InvalidOperationException or whatever else if you want)

Then the following extension method for the WebClient class will do the trick:

public static class WebClientExtensions {     public static string DownloadStringAwareOfEncoding(this WebClient webClient, Uri uri)     {         var rawData = webClient.DownloadData(uri);         var encoding = WebUtils.GetEncodingFrom(webClient.ResponseHeaders, Encoding.UTF8);         return encoding.GetString(rawData);     } } 

So in your example you would do:

urlData = wc.DownloadStringAwareOfEncoding(uri); 

...and that's it.

like image 43
Konamiman Avatar answered Sep 29 '22 17:09

Konamiman