Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing download of multiple web pages. C#

I am developing an app where I need to download a bunch of web pages, preferably as fast as possible. The way that I do that right now is that I have multiple threads (100's) that have their own System.Net.HttpWebRequest. This sort of works, but I am not getting the performance I would like. Currently I have a beefy 600+ Mb/s connection to work with, and this is only utilized at most 10% (at peaks). I guess my strategy is flawed, but I am unable to find any other good way of doing this.

Also: If the use of HttpWebRequest is not a good way to download web pages, please say so :) The code has been semi-auto-converted from java.

Thanks :)

Update:

public String getPage(String link){
   myURL = new System.Uri(link);
   myHttpConn = (System.Net.HttpWebRequest)System.Net.WebRequest.Create(myURL);
   myStreamReader = new System.IO.StreamReader(new System.IO.StreamReader(myHttpConn.GetResponse().GetResponseStream(),
            System.Text.Encoding.Default).BaseStream,
                new System.IO.StreamReader(myHttpConn.GetResponse().GetResponseStream(),
                    System.Text.Encoding.Default).CurrentEncoding);

        System.Text.StringBuilder buffer = new System.Text.StringBuilder();

        //myLineBuff is a String
        while ((myLineBuff = myStreamReader.ReadLine()) != null)
        {
            buffer.Append(myLineBuff);
        }
   return buffer.toString();
}
like image 399
Automatico Avatar asked May 19 '11 16:05

Automatico


1 Answers

One problem is that it appears you're issuing each request twice:

myStreamReader = new System.IO.StreamReader(
    new System.IO.StreamReader(
        myHttpConn.GetResponse().GetResponseStream(),
        System.Text.Encoding.Default).BaseStream,
             new System.IO.StreamReader(myHttpConn.GetResponse().GetResponseStream(),
                 System.Text.Encoding.Default).CurrentEncoding);

It makes two calls to GetResponse. For reasons I fail to understand, you're also creating two stream readers. You can split that up and simplify it, and also do a better job of error handling...

var response = (HttpWebResponse)myHttpCon.GetResponse();
myStreamReader = new StreamReader(response.GetResponseStream(), Encoding.Default)

That should double your effective throughput.

Also, you probably want to make sure to dispose of the objects you're using. When you're downloading a lot of pages, you can quickly run out of resources if you don't clean up after yourself. In this case, you should call response.Close(). See http://msdn.microsoft.com/en-us/library/system.net.httpwebresponse.close.aspx

like image 125
Jim Mischel Avatar answered Sep 25 '22 01:09

Jim Mischel