I'm trying to build a specialised search engine web site that indexes a limited number of web sites. The solution I came up with is:
The problem is that I find Nutch quite complex and it's a big piece of software to customise, despite the fact that a detailed documentation (books, recent tutorials.. etc) does just not exist.
Questions now:
Thanks
Techopedia Explains Apache Nutch Along with tools like Apache Hadoop and features for file storing, analysis and more, the role of Nutch is to collect and store data from the web through the use of web crawling algorithms. Users can take advantage of simple commands in Apache Nutch to collect information under URLs.
Nutch is an open source crawler which provides the Java library for crawling, indexing and database storage. Solr is an open source search platform which provides full-text search and integration with Nutch. The following contents are steps of setting up Nutch and Solr for crawling and searching.
Scrapy is a python library that crawls web sites. It is fairly small (compared to Nutch) and designed for limited site crawls. It has a Django type MVC style that I found pretty easy to customize.
For the crawling part, I really like anemone and crawler4j. They both allow you to add your custom logic for links selection and page handling. For each page that you decide to keep, you can easily add the call to Solr.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With