Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Crawling The Internet

I want to crawl for specific things. Specifically events that are taking place like concerts, movies, art gallery openings, etc, etc. Anything that one might spend time going to.

How do I implement a crawler?

I have heard of Grub (grub.org -> Wikia) and Heritix (http://crawler.archive.org/)

Are there others?

What opinions does everyone have?

-Jason

like image 880
Toddly Avatar asked Apr 07 '09 23:04

Toddly


3 Answers

An excellent introductory text for that topic is Introduction to Information Retrieval (full text available online). It has a chapter on Web crawling, but perhaps more importantly, it provides a basis for the things you want to do with the crawled documents.

Introduction to Information Retrieval
(source: stanford.edu)

like image 92
Fabian Steeg Avatar answered Nov 19 '22 22:11

Fabian Steeg


There's a good book on the subject I can recommend called Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL.

like image 29
Bill the Lizard Avatar answered Nov 19 '22 23:11

Bill the Lizard


Whatever you do, please be a good citizen and obey the robots.txt file. You might want to check the references at the wikipedia page on focused crawlers. Just realized that I know one of the authors of Topical Web Crawlers: Evaluating Adaptive Algorithms. Small world.

like image 5
tvanfosson Avatar answered Nov 19 '22 22:11

tvanfosson