Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which Open Source Crawler is best?

I am comparing these four Nutch / Heritrix / OpenPipeLine / Apache Tika Which one is best? What are merits and demerits of each? I would like to have some extendible crawler that can crawl a list of websites and can be modified if needed.

like image 683
Riz Avatar asked Oct 10 '22 07:10

Riz


1 Answers

Nutch is the most all around of them, extremely configurable. Tried with 100m documents. Trustworthy.

Heritrix works fine too, but not better than Nutch.

You can give Crawler4j a try if you need to crawl fast.

To do an introductory crawl and use and configure the crawler easily with a simple user interface, you can try websphinx.

Tika isn't a crawler : it's a toolkit detects and extracts metadata and structured text content

I had a job that required crawling, but OpenPipeLine wasn't in the list of favourite crawlers. It has an UI, job scheduler; it's used for enterprise solutions. As you just want to crawl some websites, you won't need such things.

like image 181
İsmet Alkan Avatar answered Oct 13 '22 10:10

İsmet Alkan