Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

An alternative web crawler to Nutch [closed]

Tags:

I'm trying to build a specialised search engine web site that indexes a limited number of web sites. The solution I came up with is:

  • using Nutch as the web crawler,
  • using Solr as the search engine,
  • the front-end and the site logic is coded with Wicket.

The problem is that I find Nutch quite complex and it's a big piece of software to customise, despite the fact that a detailed documentation (books, recent tutorials.. etc) does just not exist.

Questions now:

  1. Any constructive criticism about the hole idea of the site?
  2. Is there a good yet simple alternative to Nutch (as the crawling part of the site)?

Thanks

like image 945
wassimans Avatar asked Nov 24 '10 17:11

wassimans


People also ask

How does Apache Nutch work?

Techopedia Explains Apache Nutch Along with tools like Apache Hadoop and features for file storing, analysis and more, the role of Nutch is to collect and store data from the web through the use of web crawling algorithms. Users can take advantage of simple commands in Apache Nutch to collect information under URLs.

What is nutch SOLR?

Nutch is an open source crawler which provides the Java library for crawling, indexing and database storage. Solr is an open source search platform which provides full-text search and integration with Nutch. The following contents are steps of setting up Nutch and Solr for crawling and searching.


2 Answers

Scrapy is a python library that crawls web sites. It is fairly small (compared to Nutch) and designed for limited site crawls. It has a Django type MVC style that I found pretty easy to customize.

like image 72
nate c Avatar answered Oct 04 '22 13:10

nate c


For the crawling part, I really like anemone and crawler4j. They both allow you to add your custom logic for links selection and page handling. For each page that you decide to keep, you can easily add the call to Solr.

like image 33
Pascal Dimassimo Avatar answered Oct 04 '22 12:10

Pascal Dimassimo