I am comparing these four Nutch / Heritrix / OpenPipeLine / Apache Tika Which one is best? What are merits and demerits of each? I would like to have some extendible crawler that can crawl a list of websites and can be modified if needed.
Nutch is the most all around of them, extremely configurable. Tried with 100m documents. Trustworthy.
Heritrix works fine too, but not better than Nutch.
You can give Crawler4j a try if you need to crawl fast.
To do an introductory crawl and use and configure the crawler easily with a simple user interface, you can try websphinx.
Tika isn't a crawler : it's a toolkit detects and extracts metadata and structured text content
I had a job that required crawling, but OpenPipeLine wasn't in the list of favourite crawlers. It has an UI, job scheduler; it's used for enterprise solutions. As you just want to crawl some websites, you won't need such things.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With