Crawling The Internet

Question

I want to crawl for specific things. Specifically events that are taking place like concerts, movies, art gallery openings, etc, etc. Anything that one might spend time going to.

How do I implement a crawler?

I have heard of Grub (grub.org -> Wikia) and Heritix (http://crawler.archive.org/)

Are there others?

What opinions does everyone have?

-Jason

Fabian Steeg · Accepted Answer

An excellent introductory text for that topic is Introduction to Information Retrieval (full text available online). It has a chapter on Web crawling, but perhaps more importantly, it provides a basis for the things you want to do with the crawled documents.

Introduction to Information Retrieval
_{(source: stanford.edu)}

Bill the Lizard · Answer

There's a good book on the subject I can recommend called Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL.

tvanfosson · Answer

Whatever you do, please be a good citizen and obey the robots.txt file. You might want to check the references at the wikipedia page on focused crawlers. Just realized that I know one of the authors of Topical Web Crawlers: Evaluating Adaptive Algorithms. Small world.

Crawling The Internet

Tags:

nlp

text-mining

web-crawler

information-retrieval

Toddly

3 Answers

Fabian Steeg

Bill the Lizard

tvanfosson

Recent Activity

Donate For Us

Crawling The Internet

Tags:

nlp

text-mining

web-crawler

information-retrieval

Toddly

3 Answers

Fabian Steeg

Bill the Lizard

tvanfosson

Related questions

Recent Activity

Donate For Us