Which Open Source Crawler is best?

Question

I am comparing these four Nutch / Heritrix / OpenPipeLine / Apache Tika Which one is best? What are merits and demerits of each? I would like to have some extendible crawler that can crawl a list of websites and can be modified if needed.

İsmet Alkan · Accepted Answer

Nutch is the most all around of them, extremely configurable. Tried with 100m documents. Trustworthy.

Heritrix works fine too, but not better than Nutch.

You can give Crawler4j a try if you need to crawl fast.

To do an introductory crawl and use and configure the crawler easily with a simple user interface, you can try websphinx.

Tika isn't a crawler : it's a toolkit detects and extracts metadata and structured text content

I had a job that required crawling, but OpenPipeLine wasn't in the list of favourite crawlers. It has an UI, job scheduler; it's used for enterprise solutions. As you just want to crawl some websites, you won't need such things.

Which Open Source Crawler is best?

Tags:

web-crawler

nutch

Riz

1 Answers

İsmet Alkan

Recent Activity

Donate For Us

Which Open Source Crawler is best?

Tags:

web-crawler

nutch

Riz

1 Answers

İsmet Alkan

Related questions

Recent Activity

Donate For Us