I am trying to download the contents of a website. However for a certain webpage the string returned contains jumbled data, containing many � characters.
Here is the code I was originally using.
HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(url);
req.Method = "GET";
req.UserAgent = "Mozilla/5.0 (Windows; U; MSIE 9.0; WIndows NT 9.0; en-US))";
string source;
using (StreamReader reader = new StreamReader(req.GetResponse().GetResponseStream()))
{
source = reader.ReadToEnd();
}
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(source);
I also tried alternate implementations with WebClient, but still the same result:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
using (WebClient client = new WebClient())
using (var read = client.OpenRead(url))
{
doc.Load(read, true);
}
From searching I guess this might be an issue with Encoding, so I tried both the solutions posted below but still cannot get this to work.
The offending site that I cannot seem to download is the United_States article on the english version of WikiPedia (en . wikipedia . org / wiki / United_States). Although I have tried a number of other wikipedia articles and have not seen this issue.
In the real sense it has no meaning or full form. It was developed by Dennis Ritchie and Ken Thompson at AT&T bell Lab. First, they used to call it as B language then later they made some improvement into it and renamed it as C and its superscript as C++ which was invented by Dr.
" " C is a computer programming language. That means that you can use C to create lists of instructions for a computer to follow. C is one of thousands of programming languages currently in use.
C programming language is a machine-independent programming language that is mainly used to create many types of applications and operating systems such as Windows, and other complicated programs such as the Oracle database, Git, Python interpreter, and games and is considered a programming foundation in the process of ...
Using the built-in loader in HtmlAgilityPack worked for me:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://en.wikipedia.org/wiki/United_States");
string html = doc.DocumentNode.OuterHtml; // I don't see no jumbled data here
Edit:
Using a standard WebClient
with your user-agent will result in a HTTP 403 - forbidden - using this instead worked for me:
using (WebClient wc = new WebClient())
{
wc.Headers.Add("user-agent", "Mozilla/5.0 (Windows; Windows NT 5.1; rv:1.9.2.4) Gecko/20100611 Firefox/3.6.4");
string html = wc.DownloadString("http://en.wikipedia.org/wiki/United_States");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
}
Also see this SO thread: WebClient forbids opening wikipedia page?
The response is gzip encoded. Try the following to decode the stream:
UPDATE
Based on the comment by BrokenGlass setting the following properties should solve your problem (worked for me):
req.Headers[HttpRequestHeader.AcceptEncoding] = "gzip, deflate";
req.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
Old/Manual solution:
string source;
var response = req.GetResponse();
var stream = response.GetResponseStream();
try
{
if (response.Headers.AllKeys.Contains("Content-Encoding")
&& response.Headers["Content-Encoding"].Contains("gzip"))
{
stream = new System.IO.Compression.GZipStream(stream, System.IO.Compression.CompressionMode.Decompress);
}
using (StreamReader reader = new StreamReader(stream))
{
source = reader.ReadToEnd();
}
}
finally
{
if (stream != null)
stream.Dispose();
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With