Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTTPWebResponse + StreamReader Very Slow

I'm trying to implement a limited web crawler in C# (for a few hundred sites only) using HttpWebResponse.GetResponse() and Streamreader.ReadToEnd() , also tried using StreamReader.Read() and a loop to build my HTML string.

I'm only downloading pages which are about 5-10K.

It's all very slow! For example, the average GetResponse() time is about half a second, while the average StreamREader.ReadToEnd() time is about 5 seconds!

All sites should be very fast, as they are very close to my location, and have fast servers. (in Explorer takes practically nothing to D/L) and I am not using any proxy.

My Crawler has about 20 threads reading simultaneously from the same site. Could this be causing a problem?

How do I reduce StreamReader.ReadToEnd times DRASTICALLY?

like image 313
Roey Avatar asked May 23 '09 11:05

Roey


3 Answers

HttpWebRequest may be taking a while to detect your proxy settings. Try adding this to your application config:

<system.net>
  <defaultProxy enabled="false">
    <proxy/>
    <bypasslist/>
    <module/>
  </defaultProxy>
</system.net>

You might also see a slight performance gain from buffering your reads to reduce the number of calls made to the underlying operating system socket:

using (BufferedStream buffer = new BufferedStream(stream))
{
  using (StreamReader reader = new StreamReader(buffer))
  {
    pageContent = reader.ReadToEnd();
  }
}
like image 176
kgriffs Avatar answered Nov 08 '22 23:11

kgriffs


WebClient's DownloadString is a simple wrapper for HttpWebRequest, could you try using that temporarily and see if the speed improves? If things get much faster, could you share your code so we can have a look at what may be wrong with it?

EDIT:

It seems HttpWebRequest observes IE's 'max concurrent connections' setting, are these URLs on the same domain? You could try increasing the connections limit to see if that helps? I found this article about the problem:

By default, you can't perform more than 2-3 async HttpWebRequest (depends on the OS). In order to override it (the easiest way, IMHO) don't forget to add this under section in the application's config file:

<system.net>
  <connectionManagement>
     <add address="*" maxconnection="65000" />
  </connectionManagement>
</system.net>
like image 8
Matt Brindley Avatar answered Nov 08 '22 23:11

Matt Brindley


I had the same problem, but when I sat the HttpWebRequest's Proxy parameter to null, it solved the problem.

UriBuilder ub = new UriBuilder(url);
HttpWebRequest request = (HttpWebRequest)WebRequest.Create( ub.Uri );
request.Proxy = null;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
like image 4
bisand Avatar answered Nov 08 '22 21:11

bisand