Recrawl URL with Nutch just for updated sites

Question

I crawled one URL with Nutch 2.1 and then I want to re-crawl pages after they got updated. How can I do this? How can I know that a page is updated?

İsmet Alkan · Accepted Answer

Simply you can't. You need to recrawl the page to control if it's updated. So according to your needs, prioritize the pages/domains and recrawl them within a time period. For that you need a job scheduler such as Quartz.

You need to write a function that compares the pages. However, Nutch originally saves the pages as index files. In other words Nutch generates new binary files to save HTMLs. I don't think it's possible to compare binary files, as Nutch combines all crawl results within a single file. If you want to save pages in raw HTML format to compare, see my answer to this question.

Jayendra · Answer

You have to Schedule ta Job for Firing the Job
However, Nutch AdaptiveFetchSchedule should enable you to crawl and index pages and detect whether the page is new or updated and you don't have to do it manually.

Article describes the same in detail.

user1973842 · Answer

what about http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/

This is discussed on : How to recrawle nutch

I am wondering if the above mentioned solution will indeed work. I am trying as we speak. I crawl news-sites and they update their frontpage quite frequently, so I need to re-crawl the index/frontpage often and fetch the newly discovered links.

Recrawl URL with Nutch just for updated sites

Tags:

apache

solr

lucene

web-crawler

nutch

Ilce MKD

3 Answers

İsmet Alkan

Jayendra

user1973842

Recent Activity

Donate For Us

Recrawl URL with Nutch just for updated sites

Tags:

apache

solr

lucene

web-crawler

nutch

Ilce MKD

3 Answers

İsmet Alkan

Jayendra

user1973842

Related questions

Recent Activity

Donate For Us