Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Concurrency Limit on HttpWebRequest

I am writing an application to measure how fast I can download web pages using C#. I supply a list of unique domain names, then I spawn X number of threads and perform HTTPWebRequests until the list of domains has been consumed. The problem is that no matter how many threads I use, I only get about 3 pages per second.

I discovered that the System.Net.ServicePointManager.DefaultConnectionLimit is 2, but I was under the impression that this is related to the number of connections per domain. Since each domain in the list is unique, this should not be an issue.

Then I found that the GetResponse() method blocks access from all other processes until the WebResponse is closed: http://www.codeproject.com/KB/IP/Crawler.aspx#WebRequest, I have not found any other information on the web to back this claim up, however I implemented a HTTP request using sockets, and I noticed a significant speed up (4x to 6x).

So my questions: does anyone know exactly how the HttpWebRequest objects work?, is there a workaround besides what was mentioned above?, or are there any examples of high speed web crawlers written in C# anywhere?

like image 320
Kam Sheffield Avatar asked Feb 03 '23 00:02

Kam Sheffield


2 Answers

Have you tried using the async methods such as BeginGetResponse() ?

If you're using .net 4.0 you may want to try this code. Essentially I use Tasks to make 1000 requests on a specific site (I use this to do load testing of app on my dev machine and I see no limits as such since my app is seeing these requests in rapid succession)

  public partial class Form1 : Form
  {
    public Form1()
    {
      InitializeComponent();
    }

    private void button1_Click(object sender, EventArgs e)
    {
      for (int i = 0; i < 1000; i++)
      {
        var webRequest = WebRequest.Create(textBox1.Text);
        webRequest.GetReponseAsync().ContinueWith(t =>
        {
          if (t.Exception == null)
          {
            using (var sr = new StreamReader(t.Result.GetResponseStream()))
            {
              string str = sr.ReadToEnd();
            }
          }
          else
            System.Diagnostics.Debug.WriteLine(t.Exception.InnerException.Message);
        });
      }
    }
  }

  public static class WebRequestExtensions
  {
    public static Task<WebResponse> GetReponseAsync(this WebRequest request)
    {
      return Task.Factory.FromAsync<WebResponse>(request.BeginGetResponse, request.EndGetResponse, null);
    }
  }

Since the workload here is I/O bound, spawning threads to get the job done is not required and in fact could hurt performance. Using the Async methods on the WebClient class use I/O completion ports and so will be much more performant and less resource hungry.

like image 106
Shiv Kumar Avatar answered Feb 05 '23 17:02

Shiv Kumar


You should be using the BeginGetResponse method which doesn't block and is asynchronous.

When dealing with I/O bound asynchrony, just because you spawn a thread to do the I/O work, that thread will still be blocked waiting for the hardware (in this case the network card) to respond. If you use the built in BeginGetResponse, then that thread will just queue it up on the network card, and will then be available to do more work. When the hardware is done, it'll notify you, at which point your callback will be called.

like image 33
BFree Avatar answered Feb 05 '23 16:02

BFree