Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to best parallelize parsing of webpages?

I am using the html agility pack to parse individual pages of a forum website. So the parsing method returns all the topic/thread links on the page link, passed as an argument. I gather all these topic links of all the parsed pages in a single collection.

After that, I check if they are on my Dictionary of already-viewed urls, and if they are not, then I add them to a new list and the UI shows this list, which is basically new topics/threads created since last time.

Since all these operations seem independent, what would be the best way to parallelize this?

Should I use .NET 4.0's Parallel.For/ForEach?

Either way, how can I gather the results of each page in a single collection? Or is this not necessary?

Can I read from my centralized Dictionary whenever a parse method finishes to see if they are there, simultaneously?

If I run this program for 4000 pages, it takes like 90 mins, it would be great if I could use all my 8 cores to finish the same task in ~10 mins.

like image 801
Joan Venge Avatar asked Oct 11 '11 21:10

Joan Venge


2 Answers

Parallel.For/ForEach combined with a ConcurrentDictionary<TKey, TValue> to share state between different threads seem like a good way to implement this. The concurrent dictionary ensures safe read/write from multiple threads.

like image 70
Darin Dimitrov Avatar answered Oct 20 '22 02:10

Darin Dimitrov


After that, I check if they are on my Dictionary of already-viewed urls, and if they are not, then I add them to a new list and the UI shows this list, which is basically new topics/threads created since last time. Since all these operations seem independent, what would be the best way to parallelize this?

You can certainly use Parallel.For/ForEach to do that, but you should think about the design of your crawler a bit. Most crawlers tend to dedicate several threads to crawling and each thread is associated with a page fetching client which is responsible for fetching the pages (in your case, probably using the WebRequest/WebResponse) I would recommend reading these papers:

  • Mercator: A scalable, extensible Web crawler (an 11 page paper, should be a pretty light read).
  • IRLbot: Scaling to 6 Billion Pages and Beyond (a 10 page paper that describes a crawler that crawls at about 600 pages per second on a 150 Mbit connection).
  • IRLbot: Scaling to 6 billion pages and beyond: full paper

If you implement the Mercator design, then you should easily be able to download 50 pages per second, so you 4000 pages will be downloaded in 80 seconds.

Either way, how can I gather the results of each page in a single collection?

You can store your results in a ConcurrentDictionary<TKey, TValue>, like Darin mentioned. You don't need to store anything in the value, since your key would be the link/URL, however if you're performing a URL-seen Test then you can hash each link/URL into an integer and then store the hash as the key and the link/URL as the value.

Or is this not necessary?

It's entirely up to you to decide what's necessary, but if you're performing a URL-seen Test, then it is necessary.

Can I read from my centralized Dictionary whenever a parse method finishes to see if they are there, simultaneously?

Yes, the ConcurrentDictionary allows multiple threads to read simultaneously, so it should be fine. It will work fine if you just want to see if a link has already been crawled.

If I run this program for 4000 pages, it takes like 90 mins, it would be great if I could use all my 8 cores to finish the same task in ~10 mins.

If you design your crawler sufficiently well, you should be able to download and parse (extracts all the links) of 4000 pages in about 57 seconds on an average desktop PC... I get roughly those results with the standard C# WebRequest on a 4GB, i5 3.2 GHz PC with a 10 Mbps connection.

like image 1
Kiril Avatar answered Oct 20 '22 00:10

Kiril