I am trying to download the contents of a website. However for a certain webpage the string returned contains jumbled data, containing many � characters. Here is the code I was originally using. <pre class="prettyprint"><code>HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(url); req.Method = "GET"; req.UserAgent = "Mozilla/5.0 (Windows; U; MSIE 9.0; WIndows NT 9.0; en-US))"; string source; using (StreamReader reader = new StreamReader(req.GetResponse().GetResponseStream())) { source = reader.ReadToEnd(); } HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); doc.LoadHtml(source); </code></pre> I also tried alternate implementations with WebClient, but still the same result: <pre class="prettyprint"><code>HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); using (WebClient client = new WebClient()) using (var read = client.OpenRead(url)) { doc.Load(read, true); } </code></pre> From searching I guess this might be an issue with Encoding, so I tried both the solutions posted below but still cannot get this to work. <ul> <li>http://blogs.msdn.com/b/feroze_daud/archive/2004/03/30/104440.aspx</li> <li>http://bytes.com/topic/c-sharp/answers/653250-webclient-encoding</li> </ul> The offending site that I cannot seem to download is the United_States article on the english version of WikiPedia (en . wikipedia . org / wiki / United_States). Although I have tried a number of other wikipedia articles and have not seen this issue.

The response is gzip encoded. Try the following to decode the stream: UPDATE Based on the comment by BrokenGlass setting the following properties should solve your problem (worked for me): <pre class="prettyprint"><code>req.Headers[HttpRequestHeader.AcceptEncoding] = "gzip, deflate"; req.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip; </code></pre> Old/Manual solution: <pre class="prettyprint"><code>string source; var response = req.GetResponse(); var stream = response.GetResponseStream(); try { if (response.Headers.AllKeys.Contains("Content-Encoding") && response.Headers["Content-Encoding"].Contains("gzip")) { stream = new System.IO.Compression.GZipStream(stream, System.IO.Compression.CompressionMode.Decompress); } using (StreamReader reader = new StreamReader(stream)) { source = reader.ReadToEnd(); } } finally { if (stream != null) stream.Dispose(); } </code></pre>

C# Downloading website into string using C# WebClient or HttpWebRequest

Tags:

c#

httpwebrequest

webclient

I am trying to download the contents of a website. However for a certain webpage the string returned contains jumbled data, containing many � characters.

Here is the code I was originally using.

HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(url);
req.Method = "GET";
req.UserAgent = "Mozilla/5.0 (Windows; U; MSIE 9.0; WIndows NT 9.0; en-US))";
string source;
using (StreamReader reader = new StreamReader(req.GetResponse().GetResponseStream()))
{
    source = reader.ReadToEnd();
}
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(source);

I also tried alternate implementations with WebClient, but still the same result:

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
using (WebClient client = new WebClient())
using (var read = client.OpenRead(url))
{
    doc.Load(read, true);
}

From searching I guess this might be an issue with Encoding, so I tried both the solutions posted below but still cannot get this to work.

http://blogs.msdn.com/b/feroze_daud/archive/2004/03/30/104440.aspx
http://bytes.com/topic/c-sharp/answers/653250-webclient-encoding

The offending site that I cannot seem to download is the United_States article on the english version of WikiPedia (en . wikipedia . org / wiki / United_States). Although I have tried a number of other wikipedia articles and have not seen this issue.

452

asked Sep 22 '11 16:09

Nick Collier

2 Answers

Using the built-in loader in HtmlAgilityPack worked for me:

HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://en.wikipedia.org/wiki/United_States");
string html = doc.DocumentNode.OuterHtml; // I don't see no jumbled data here

Edit:

Using a standard WebClient with your user-agent will result in a HTTP 403 - forbidden - using this instead worked for me:

using (WebClient wc = new WebClient())
{
    wc.Headers.Add("user-agent", "Mozilla/5.0 (Windows; Windows NT 5.1; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4");
    string html = wc.DownloadString("http://en.wikipedia.org/wiki/United_States");
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);
}

Also see this SO thread: WebClient forbids opening wikipedia page?

200

answered Oct 24 '22 04:10

BrokenGlass

The response is gzip encoded. Try the following to decode the stream:

UPDATE

Based on the comment by BrokenGlass setting the following properties should solve your problem (worked for me):

req.Headers[HttpRequestHeader.AcceptEncoding] = "gzip, deflate";
req.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;

Old/Manual solution:

string source;
var response = req.GetResponse();

var stream = response.GetResponseStream();
try
{
    if (response.Headers.AllKeys.Contains("Content-Encoding")
        && response.Headers["Content-Encoding"].Contains("gzip"))
    {
        stream = new System.IO.Compression.GZipStream(stream, System.IO.Compression.CompressionMode.Decompress);
    }

    using (StreamReader reader = new StreamReader(stream))
    {
        source = reader.ReadToEnd();
    }
}
finally
{
    if (stream != null)
        stream.Dispose();
}

answered Oct 24 '22 05:10

Peter

Related questions
                            
                                How can I pre measure a string before is printed?
                            
                                Dapper - Sql to Object implicit cast from string to int
                            
                                Should I use WCF to implement a given binary network protocol?
                            
                                C# .NET communicate between computers over network [closed]
                            
                                Gridsplitter not showing
                            
                                C# Predicate Builder with "NOT IN" functionality
                            
                                How best to create and execute a method in a .NET (C#) class dynamically through configuration
                            
                                C# and AOP - AOPAlliance (Aspect-oriented programming) how does this work
                            
                                C# WPF BitmapSource Memory Leak?
                            
                                Creating a 2D polygon in XNA
                            
                                Parsing DateTime with a known but not given time zone
                            
                                how to add dropshadoweffect to just the text of a textbox (programmatically)
                            
                                How to best expose methods in WinForm?
                            
                                GetExportedTypes() FileNotFoundException: Assembly couldn't be found
                            
                                SSRS Set Parameters programmatically causing to state ValidValueMissing
                            
                                GetResponse throws WebException and ex.Response is null
                            
                                Ninject binding with WhenInjectedInto extension method
                            
                                Nunit Testing MVC Site
                            
                                EPPlus 2.9.0.1 throws System.IO.IsolatedStorage.IsolatedStorageException when trying to save a file bigger than ~1.5 MiB from a SSIS package
                            
                                Any workaround to get text in an iFrame on another domain in a WebBrowser?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With