Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# Downloading website into string using C# WebClient or HttpWebRequest

I am trying to download the contents of a website. However for a certain webpage the string returned contains jumbled data, containing many � characters.

Here is the code I was originally using.

HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(url);
req.Method = "GET";
req.UserAgent = "Mozilla/5.0 (Windows; U; MSIE 9.0; WIndows NT 9.0; en-US))";
string source;
using (StreamReader reader = new StreamReader(req.GetResponse().GetResponseStream()))
{
    source = reader.ReadToEnd();
}
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(source);

I also tried alternate implementations with WebClient, but still the same result:

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
using (WebClient client = new WebClient())
using (var read = client.OpenRead(url))
{
    doc.Load(read, true);
}

From searching I guess this might be an issue with Encoding, so I tried both the solutions posted below but still cannot get this to work.

  • http://blogs.msdn.com/b/feroze_daud/archive/2004/03/30/104440.aspx
  • http://bytes.com/topic/c-sharp/answers/653250-webclient-encoding

The offending site that I cannot seem to download is the United_States article on the english version of WikiPedia (en . wikipedia . org / wiki / United_States). Although I have tried a number of other wikipedia articles and have not seen this issue.

like image 452
Nick Collier Avatar asked Sep 22 '11 16:09

Nick Collier


People also ask

What is the full name of C?

In the real sense it has no meaning or full form. It was developed by Dennis Ritchie and Ken Thompson at AT&T bell Lab. First, they used to call it as B language then later they made some improvement into it and renamed it as C and its superscript as C++ which was invented by Dr.

What do you mean by C?

" " C is a computer programming language. That means that you can use C to create lists of instructions for a computer to follow. C is one of thousands of programming languages currently in use.

What is C language used for?

C programming language is a machine-independent programming language that is mainly used to create many types of applications and operating systems such as Windows, and other complicated programs such as the Oracle database, Git, Python interpreter, and games and is considered a programming foundation in the process of ...


2 Answers

Using the built-in loader in HtmlAgilityPack worked for me:

HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://en.wikipedia.org/wiki/United_States");
string html = doc.DocumentNode.OuterHtml; // I don't see no jumbled data here

Edit:

Using a standard WebClient with your user-agent will result in a HTTP 403 - forbidden - using this instead worked for me:

using (WebClient wc = new WebClient())
{
    wc.Headers.Add("user-agent", "Mozilla/5.0 (Windows; Windows NT 5.1; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4");
    string html = wc.DownloadString("http://en.wikipedia.org/wiki/United_States");
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);
}

Also see this SO thread: WebClient forbids opening wikipedia page?

like image 200
BrokenGlass Avatar answered Oct 24 '22 04:10

BrokenGlass


The response is gzip encoded. Try the following to decode the stream:

UPDATE

Based on the comment by BrokenGlass setting the following properties should solve your problem (worked for me):

req.Headers[HttpRequestHeader.AcceptEncoding] = "gzip, deflate";
req.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;

Old/Manual solution:

string source;
var response = req.GetResponse();

var stream = response.GetResponseStream();
try
{
    if (response.Headers.AllKeys.Contains("Content-Encoding")
        && response.Headers["Content-Encoding"].Contains("gzip"))
    {
        stream = new System.IO.Compression.GZipStream(stream, System.IO.Compression.CompressionMode.Decompress);
    }

    using (StreamReader reader = new StreamReader(stream))
    {
        source = reader.ReadToEnd();
    }
}
finally
{
    if (stream != null)
        stream.Dispose();
}
like image 21
Peter Avatar answered Oct 24 '22 05:10

Peter