Concurrent web request performance issues

Question

I am working on a new service to run QA for our companies' multiple web properties, and have run into an interesting network concurrency issue. To increase performance, I am using the TPL to create HttpWebRequests based from a large collection of urls so that they can run in parallel; however, I can't seem to find where the bottleneck is in the process.

My observations so far:

I can get a max of about 25-30 parallel threads via the TPL
The CPU never breaks 5-6% for the service (running on 1 - 4 cores, with and without H/T)
NIC usage never breaks 2-3%
Overall network traffic doesn't seem to be affected (other users don't complain, speed tests run at the same time don't show much of an affect)
Speed does not change much between running on our office network (15Mbps) or our data center (100+Mbps)
I get a bit of a performance gain by downloading from multiple hosts at once rather than a lot of pages from one host.

Possible pain points:

CPU (number of cores or hardware threads)
NIC
Max allowed number of concurrent HttpWebRequests
LAN
WAN
Router/Switch/Load balancer

So the question is:

Obviously there is now way to download the entire internet in a matter of minutes, but I am interested to know where the bottleneck is in a scenario like this and what, if anything, can be done to overcome it.

As a side note, we are currently using a 3rd party service for crawling, but we are limited by them in some ways and would like more flexibility. Something about corporate secret sauce or poison on the tip of the arrow ... :)

usr · Accepted Answer

I strongly suspect one of the following is the cause:

You are running into the default connection limit. Check the value of ServicePointManager.DefaultConnectionLimit. I recommend you set it to a practically infinite value such as 1000.
The TPL is not starting as many threads as are necessary to saturate the network. Notice, that remote web servers can have a large amount of latency. While waiting, your thread is not putting load on the network.

The TPL does not guarantee you any minimum degree of parallelism (DOP). That is a pity because sometimes you really need to control the degree of parallelism exactly when working with IO.

I recommend you manually start a fixed number of threads to do your IO because that is the only way to guarantee a specific DOP. You need to experiment with the exact value. It could be in the range of 50 to 500. You can reduce the default stack size of your threads to save memory with that many threads.

Ilya Kozhevnikov · Answer

Maybe you're hitting TCP connections limit, or not disposing of connections properly, in any case try using something like JMeter to see the max concurrent HTTP throughput you can get.

Marcel N. · Answer

The code is really very simple. I use Parallel.ForEach to loop through a collection of URLs (strings). The action creates an HttpWebRequest and then dumps the results into a ConcurrentBag. BTW, NCrawler seems interesting; I'll check it out. Thanks for the tip.

Because with Parallel.ForEach is impossible to control the number of threads,then I suggest at least switching to a ThreadPool.

You can use QueueUserWorkItem to allocate work until your task collection is completely pushed to worker threads or until the method returns false (no more threads in pool).

With ThreadPool you can control the maximum number of threads to be allocated with SetMaxThreads.

Concurrent web request performance issues

Tags:

c#

httpwebrequest

task-parallel-library

Steve Konves

3 Answers

usr

Ilya Kozhevnikov

Marcel N.

Recent Activity

Donate For Us

Concurrent web request performance issues

Tags:

c#

httpwebrequest

task-parallel-library

Steve Konves

3 Answers

usr

Ilya Kozhevnikov

Marcel N.

Related questions

Recent Activity

Donate For Us