I am using the html agility pack to parse individual pages of a forum website. So the parsing method returns all the topic/thread links on the page link, passed as an argument. I gather all these topic links of all the parsed pages in a single collection.
After that, I check if they are on my Dictionary
of already-viewed urls, and if they are not, then I add them to a new list and the UI shows this list, which is basically new topics/threads created since last time.
Since all these operations seem independent, what would be the best way to parallelize this?
Should I use .NET 4.0's Parallel.For/ForEach
?
Either way, how can I gather the results of each page in a single collection? Or is this not necessary?
Can I read from my centralized Dictionary
whenever a parse method finishes to see if they are there, simultaneously?
If I run this program for 4000 pages, it takes like 90 mins, it would be great if I could use all my 8 cores to finish the same task in ~10 mins.
Parallel.For/ForEach
combined with a ConcurrentDictionary<TKey, TValue>
to share state between different threads seem like a good way to implement this. The concurrent dictionary ensures safe read/write from multiple threads.
After that, I check if they are on my Dictionary of already-viewed urls, and if they are not, then I add them to a new list and the UI shows this list, which is basically new topics/threads created since last time. Since all these operations seem independent, what would be the best way to parallelize this?
You can certainly use Parallel.For/ForEach to do that, but you should think about the design of your crawler a bit. Most crawlers tend to dedicate several threads to crawling and each thread is associated with a page fetching client which is responsible for fetching the pages (in your case, probably using the WebRequest
/WebResponse
) I would recommend reading these papers:
If you implement the Mercator
design, then you should easily be able to download 50 pages per second, so you 4000 pages will be downloaded in 80 seconds.
Either way, how can I gather the results of each page in a single collection?
You can store your results in a ConcurrentDictionary<TKey, TValue>
, like Darin mentioned. You don't need to store anything in the value, since your key would be the link/URL, however if you're performing a URL-seen Test then you can hash each link/URL into an integer and then store the hash as the key and the link/URL as the value.
Or is this not necessary?
It's entirely up to you to decide what's necessary, but if you're performing a URL-seen Test, then it is necessary.
Can I read from my centralized Dictionary whenever a parse method finishes to see if they are there, simultaneously?
Yes, the ConcurrentDictionary
allows multiple threads to read simultaneously, so it should be fine. It will work fine if you just want to see if a link has already been crawled.
If I run this program for 4000 pages, it takes like 90 mins, it would be great if I could use all my 8 cores to finish the same task in ~10 mins.
If you design your crawler sufficiently well, you should be able to download and parse (extracts all the links) of 4000 pages in about 57 seconds on an average desktop PC... I get roughly those results with the standard C# WebRequest
on a 4GB, i5 3.2 GHz PC with a 10 Mbps connection.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With