Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Crawler url queue or hash list?

I'm re-writing the spidering/crawler portion of a Delphi 6 site mapper application that I previously wrote. The app spiders a single site.

I need to manage two aspects of this:

  1. A Queue for URLs to scan, first in, first out.
  2. A Scanned list of URLs so that links from a new page are not added to the queue if they were already visited. This list would need to be searched.

Previously these were done with a TList and a StringList respectively. Obviously the performance of these degraded on sites with thousands of links.

My question is, what should be used for these queues/list to ensure the best performance? I have little experience with hashes.

like image 315
MikeD Avatar asked Jul 28 '11 13:07

MikeD


2 Answers

IMHO hashes will be the best candidate for such lists.

In Delphi 6, you can use the THashedStringList class available in the IniFiles unit. It will be faster than a TStringList.

Notice that if you use a sorted TStringList, you could use much faster binary search, fast enough for your purpose.

For something more complete, you may take a look at those OpenSource libraries:

  • TSynBigTableMetaData to store any amount data (in your case HTML pages) associated with metadata fields - you have indexes for metadata fields, so adding and retrieval will be quick;
  • A dynamic array, with the name using a hash can be used in Delphi 6 with TDynArrayHashed.

Update:

Just a trick for sorting URIs, if you use a sorted TStringList: you may better use a backward sort function, i.e. compare the URI text beginning from the end of the string instead of the beginning, since in URI, the change is in the suffix rather than in the prefix. You'll make the sort / binary search faster.

like image 117
Arnaud Bouchez Avatar answered Oct 17 '22 18:10

Arnaud Bouchez


Trie's work great for storing large (unique) text blocks and retaining high speed searching. A while back I did a quick and dirty article for Pascal Gamer about them: http://www.pascalgamer.com/issue_details.php?i=1 that might be worth a read.

Basic concept is to create a record (class, whatever) that contains a letter or symbol and all its linked letters and symbols as children. These children are stored sorted so a quick binary search can be used to find the next node. When you reach the end of your input you can tell if your at a word end or valid position.

Nice thing about Trie's you can do partial matching, reverse lookups, skip searches, and etc without any problems. Down sides are; you can't have duplicate entries easily, they take more space on SMALLER data sets, and depending on your implementation case sensitive switching can be "interesting".

Use the concept day in and day out on millions of records with no issues and high speed retention.

like image 20
jdarling Avatar answered Oct 17 '22 17:10

jdarling