Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Developing a crawler and scraper for a vertical search engine

I need to develop a vertical search engine as part of website. The data for the search engine comes from websites of specific category. I guess for this I need to have a crawler that crawls several (a few hundred) sites (in a specific business category) and extract content and urls of products and services. Other types of pages may be irrelevant. Most of the sites are tiny or small (a few hundred pages at the most). The products have 10 to 30 attributes.

Any ideas on how to write such a crawler and extractor. I have written a few crawlers and content extractors using usual ruby libraries, but not a full fledged search engine. I guess, crawler, from time to time, wakes up and downloads the pages from websites. Usual polite behavior like checking robots exclusion rules will be followed, of course. While the content extractor can update the database after it reads the pages. How do I synchronize crawler and extractor? How tightly should they be integrated?

like image 629
Ven Avatar asked Jul 05 '09 17:07

Ven


1 Answers

Nutch builds on Lucene and already implements a crawler and several document parsers. You can also hook it to Hadoop for scalability.

like image 146
Mauricio Scheffer Avatar answered Oct 02 '22 03:10

Mauricio Scheffer