Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

best use of Parallel.ForEach / Multithreading

I need to scrape data from a website. I have over 1,000 links I need to access, and previously I was dividing the links 10 per thread, and would start 100 threads each pulling 10. After few test cases, 100 threads was the best count to minimize the time it retrieved the content for all the links.

I realized that .NET 4.0 offered better support for multi-threading out of the box, but this is done based on how many cores you have, which in my case does not spawn enough threads. I guess what I am asking is: what is the best way to optimize the 1,000 link pulling. Should I be using .ForEach and let the Parallel extension control the amount threads that get spawned, or find a way to tell it how many threads to start and divide the work?

I have not worked with Parallel before so maybe my approach maybe wrong.

like image 829
Zoinky Avatar asked Dec 15 '22 14:12

Zoinky


2 Answers

you can use MaxDegreeOfParallelism property in Parallel.ForEach to control the number of threads that will be spawned.

Heres the code snippet -

ParallelOptions opt = new ParallelOptions();
opt.MaxDegreeOfParallelism = 5;

Parallel.ForEach(Directory.GetDirectories(Constants.RootFolder), opt, MyMethod);
like image 104
whihathac Avatar answered Dec 29 '22 09:12

whihathac


In general, Parallel.ForEach() is quite good at optimizing the number of threads. It accounts for the number of cores in the system, but also takes into account what the threads are doing (CPU bound, IO bound, how long the method runs, etc.).

You can control the maximum degree of parallelization, but there's no mechanism to force more threads to be used.

Make sure your benchmarks are correct and can be compared in a fair manner (e.g. same websites, allow for a warm-up period before you start measuring, and do many runs since response time variance can be quite high scraping websites). If after careful measurement your own threading code is still faster, you can conclude that you have optimized for your particular case better than .NET and stick with your own code.

like image 23
Eric J. Avatar answered Dec 29 '22 10:12

Eric J.