I want to crawl for specific things. Specifically events that are taking place like concerts, movies, art gallery openings, etc, etc. Anything that one might spend time going to.
How do I implement a crawler?
I have heard of Grub (grub.org -> Wikia) and Heritix (http://crawler.archive.org/)
Are there others?
What opinions does everyone have?
-Jason
An excellent introductory text for that topic is Introduction to Information Retrieval (full text available online). It has a chapter on Web crawling, but perhaps more importantly, it provides a basis for the things you want to do with the crawled documents.
(source: stanford.edu)
There's a good book on the subject I can recommend called Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL.
Whatever you do, please be a good citizen and obey the robots.txt file. You might want to check the references at the wikipedia page on focused crawlers. Just realized that I know one of the authors of Topical Web Crawlers: Evaluating Adaptive Algorithms. Small world.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With