I am writing a program to download html page from other website. I found a problem that for some particular website, I cannot get the full html code. And I only can get partial content. The server with this problem are sending data in "Transfer-Encoding:chunked" I am afraid this is the reason of the problem.
This the header information returned by server:
Transfer-Encoding: chunked
Connection: keep-alive
Pragma: no-cache
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Content-Type: text/html; charset=UTF-8
Date: Sun, 11 Sep 2011 09:46:23 GMT
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Server: nginx/1.0.6
Here is my code:
HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest;
HttpWebResponse response;
CookieContainer cookie = new CookieContainer();
request.CookieContainer = cookie;
request.AllowAutoRedirect = true;
request.KeepAlive = true;
request.UserAgent =
@"Mozilla/5.0 (Windows NT 6.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2 FirePHP/0.6";
request.Accept = @"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
string html = string.Empty;
response = request.GetResponse() as HttpWebResponse;
using (StreamReader reader = new StreamReader(response.GetResponseStream()))
{
html = reader.ReadToEnd();
}
I can only get partial html code ( I think it is the first chunk from the server). Could anyone help? Any Solution?
Thanks!
You can't use ReadToEnd to read chunked data. You need to read directly from the response stream using GetBytes.
StringBuilder sb = new StringBuilder();
Byte[] buf = new byte[8192];
Stream resStream = response.GetResponseStream();
do
{
count = resStream.Read(buf, 0, buf.Length);
if(count != 0)
{
sb.Append(Encoding.UTF8.GetString(buf,0,count)); // just hardcoding UTF8 here
}
}while (count > 0);
String html = sb.ToString();
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With