Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Anybody knows a good extendable open source web-crawler? [closed]

The crawler needs to have an extendable architecture to allow changing the internal process, like implementing new steps (pre-parser, parser, etc...)

I found the Heritrix Project (http://crawler.archive.org/).

But there are other nice projects like that?

like image 499
Zanoni Avatar asked Jun 24 '09 17:06

Zanoni


3 Answers

Nutch is the best you can do when it comes to a free crawler. It is built off of the concept of Lucene (in an enterprise scaled manner) and is supported by the Hadoop back end using MapReduce (similar to Google) for large scale data querying. Great products! I am currently reading all about Hadoop in the new (not yet released) Hadoop in Action from manning. If you go this route I suggest getting onto their technical review team to get an early copy of this title!

These are all Java based. If you are a .net guy (like me!!) then you might be more interested in Lucene.NET, Nutch.NET, and Hadoop.NET which are all class by class and api by api ports to C#.

like image 195
Andrew Siemer Avatar answered Nov 19 '22 16:11

Andrew Siemer


You May also want to try Scrapy http://scrapy.org/

It is really easy to specify and run your crawlers.

like image 31
fccoelho Avatar answered Nov 19 '22 15:11

fccoelho


Abot is a good extensible web-crawler. Every part of the architecture is pluggable giving you complete control over its behavior. Its open source, free for commercial and personal use, written in C#.

https://github.com/sjdirect/abot

like image 27
sjdirect Avatar answered Nov 19 '22 17:11

sjdirect