I'm re-writing the spidering/crawler portion of a Delphi 6 site mapper application that I previously wrote. The app spiders a single site.
I need to manage two aspects of this:
Previously these were done with a TList and a StringList respectively. Obviously the performance of these degraded on sites with thousands of links.
My question is, what should be used for these queues/list to ensure the best performance? I have little experience with hashes.
IMHO hashes will be the best candidate for such lists.
In Delphi 6, you can use the THashedStringList
class available in the IniFiles
unit. It will be faster than a TStringList
.
Notice that if you use a sorted TStringList
, you could use much faster binary search, fast enough for your purpose.
For something more complete, you may take a look at those OpenSource libraries:
Update:
Just a trick for sorting URIs, if you use a sorted TStringList
: you may better use a backward sort function, i.e. compare the URI text beginning from the end of the string instead of the beginning, since in URI, the change is in the suffix rather than in the prefix. You'll make the sort / binary search faster.
Trie's work great for storing large (unique) text blocks and retaining high speed searching. A while back I did a quick and dirty article for Pascal Gamer about them: http://www.pascalgamer.com/issue_details.php?i=1 that might be worth a read.
Basic concept is to create a record (class, whatever) that contains a letter or symbol and all its linked letters and symbols as children. These children are stored sorted so a quick binary search can be used to find the next node. When you reach the end of your input you can tell if your at a word end or valid position.
Nice thing about Trie's you can do partial matching, reverse lookups, skip searches, and etc without any problems. Down sides are; you can't have duplicate entries easily, they take more space on SMALLER data sets, and depending on your implementation case sensitive switching can be "interesting".
Use the concept day in and day out on millions of records with no issues and high speed retention.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With