Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Concurrent web request performance issues

I am working on a new service to run QA for our companies' multiple web properties, and have run into an interesting network concurrency issue. To increase performance, I am using the TPL to create HttpWebRequests based from a large collection of urls so that they can run in parallel; however, I can't seem to find where the bottleneck is in the process.

My observations so far:

  • I can get a max of about 25-30 parallel threads via the TPL
  • The CPU never breaks 5-6% for the service (running on 1 - 4 cores, with and without H/T)
  • NIC usage never breaks 2-3%
  • Overall network traffic doesn't seem to be affected (other users don't complain, speed tests run at the same time don't show much of an affect)
  • Speed does not change much between running on our office network (15Mbps) or our data center (100+Mbps)
  • I get a bit of a performance gain by downloading from multiple hosts at once rather than a lot of pages from one host.

Possible pain points:

  • CPU (number of cores or hardware threads)
  • NIC
  • Max allowed number of concurrent HttpWebRequests
  • LAN
  • WAN
  • Router/Switch/Load balancer

So the question is:

Obviously there is now way to download the entire internet in a matter of minutes, but I am interested to know where the bottleneck is in a scenario like this and what, if anything, can be done to overcome it.

As a side note, we are currently using a 3rd party service for crawling, but we are limited by them in some ways and would like more flexibility. Something about corporate secret sauce or poison on the tip of the arrow ... :)

like image 627
Steve Konves Avatar asked Jun 19 '12 16:06

Steve Konves


3 Answers

I strongly suspect one of the following is the cause:

  1. You are running into the default connection limit. Check the value of ServicePointManager.DefaultConnectionLimit. I recommend you set it to a practically infinite value such as 1000.
  2. The TPL is not starting as many threads as are necessary to saturate the network. Notice, that remote web servers can have a large amount of latency. While waiting, your thread is not putting load on the network.

The TPL does not guarantee you any minimum degree of parallelism (DOP). That is a pity because sometimes you really need to control the degree of parallelism exactly when working with IO.

I recommend you manually start a fixed number of threads to do your IO because that is the only way to guarantee a specific DOP. You need to experiment with the exact value. It could be in the range of 50 to 500. You can reduce the default stack size of your threads to save memory with that many threads.

like image 170
usr Avatar answered Oct 31 '22 14:10

usr


Maybe you're hitting TCP connections limit, or not disposing of connections properly, in any case try using something like JMeter to see the max concurrent HTTP throughput you can get.

like image 23
Ilya Kozhevnikov Avatar answered Oct 31 '22 14:10

Ilya Kozhevnikov


The code is really very simple. I use Parallel.ForEach to loop through a collection of URLs (strings). The action creates an HttpWebRequest and then dumps the results into a ConcurrentBag. BTW, NCrawler seems interesting; I'll check it out. Thanks for the tip.

Because with Parallel.ForEach is impossible to control the number of threads,then I suggest at least switching to a ThreadPool.

You can use QueueUserWorkItem to allocate work until your task collection is completely pushed to worker threads or until the method returns false (no more threads in pool).

With ThreadPool you can control the maximum number of threads to be allocated with SetMaxThreads.

like image 1
Marcel N. Avatar answered Oct 31 '22 15:10

Marcel N.