Anybody knows a good extendable open source web-crawler? [closed]

Question

The crawler needs to have an extendable architecture to allow changing the internal process, like implementing new steps (pre-parser, parser, etc...)

I found the Heritrix Project (http://crawler.archive.org/).

But there are other nice projects like that?

Andrew Siemer · Accepted Answer

Nutch is the best you can do when it comes to a free crawler. It is built off of the concept of Lucene (in an enterprise scaled manner) and is supported by the Hadoop back end using MapReduce (similar to Google) for large scale data querying. Great products! I am currently reading all about Hadoop in the new (not yet released) Hadoop in Action from manning. If you go this route I suggest getting onto their technical review team to get an early copy of this title!

These are all Java based. If you are a .net guy (like me!!) then you might be more interested in Lucene.NET, Nutch.NET, and Hadoop.NET which are all class by class and api by api ports to C#.

fccoelho · Answer

You May also want to try Scrapy http://scrapy.org/

It is really easy to specify and run your crawlers.

sjdirect · Answer

Abot is a good extensible web-crawler. Every part of the architecture is pluggable giving you complete control over its behavior. Its open source, free for commercial and personal use, written in C#.

https://github.com/sjdirect/abot

Anybody knows a good extendable open source web-crawler? [closed]

Tags:

open-source

web-crawler

Zanoni

3 Answers

Andrew Siemer

fccoelho

sjdirect

Recent Activity

Donate For Us

Anybody knows a good extendable open source web-crawler? [closed]

Tags:

open-source

web-crawler

Zanoni

3 Answers

Andrew Siemer

fccoelho

sjdirect

Related questions

Recent Activity

Donate For Us