Developing a crawler and scraper for a vertical search engine

Question

I need to develop a vertical search engine as part of website. The data for the search engine comes from websites of specific category. I guess for this I need to have a crawler that crawls several (a few hundred) sites (in a specific business category) and extract content and urls of products and services. Other types of pages may be irrelevant. Most of the sites are tiny or small (a few hundred pages at the most). The products have 10 to 30 attributes.

Any ideas on how to write such a crawler and extractor. I have written a few crawlers and content extractors using usual ruby libraries, but not a full fledged search engine. I guess, crawler, from time to time, wakes up and downloads the pages from websites. Usual polite behavior like checking robots exclusion rules will be followed, of course. While the content extractor can update the database after it reads the pages. How do I synchronize crawler and extractor? How tightly should they be integrated?

Mauricio Scheffer · Accepted Answer

Nutch builds on Lucene and already implements a crawler and several document parsers. You can also hook it to Hadoop for scalability.

Developing a crawler and scraper for a vertical search engine

Tags:

search

search-engine

screen-scraping

web-crawler

Ven

1 Answers

Mauricio Scheffer

Recent Activity

Donate For Us

Developing a crawler and scraper for a vertical search engine

Tags:

search

search-engine

screen-scraping

web-crawler

Ven

1 Answers

Mauricio Scheffer

Related questions

Recent Activity

Donate For Us