Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does any open, simply extendible web crawler exists?

I search for a web crawler solution which can is mature enough and can be simply extended. I am interested in the following features... or possibility to extend the crawler to meet them:

  • partly just to read the feeds of several sites
  • to scrape the content of these sites
  • if the site has an archive I would like to crawl and index it as well
  • the crawler should be capable to explore part of the Web for me and it should be able to decide which sites matches the given criteria
  • should be able to notify me, if things possibly matching my interest were found
  • the crawler should not kill the servers by attacking it by too many requests, it should be smart doing crawling
  • the crawler should be robust against freak sites and servers

Those things above can be done one by one without any big effort, but I am interested in any solution which provide a customisable, extendible crawler. I heard of Apache Nutch, but very unsure about the project so far. Do you have experiences with it? Can you recommend alternatives?

like image 906
fifigyuri Avatar asked Jan 18 '10 10:01

fifigyuri


People also ask

How do I identify a web crawler?

Web Crawler CharacteristicsHigh HTTP request rate and typically done in parallel. Large amount of URL visits in terms of total number of URLs as well as the number of directories. More requests for specific file types versus others; for example, more requests for . html, .

What is hidden Web crawler?

The Hidden web refers to the collection of Web data which can be accessed by the crawler only through an interaction with the Web-based search form and not simply by traversing hyperlinks.

What is simple web crawler?

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).


2 Answers

I've used Nutch extensively, when I was building the open source project index for my Krugle startup. It's hard to customize, being a fairly monolithic design. There is a plug-in architecture, but the interaction between plug-ins and the system is tricky and fragile.

As a result of that experience, and needing something with more flexibility, I started the Bixo project - a web mining toolkit. http://openbixo.org.

Whether it's right for you depends on the weighting of factors such as:

  1. How much flexibility you need (+)
  2. How mature it should be (-)
  3. Whether you need the ability to scale (+)
  4. If you're comfortable with Java/Hadoop (+)
like image 81
kkrugler Avatar answered Sep 18 '22 11:09

kkrugler


A quick search on GitHub threw up Anemone, a web spider framework which seems to fit your requirements - particularly extensiblility. Written in Ruby.
Hope it goes well!

like image 36
Joseph Salisbury Avatar answered Sep 20 '22 11:09

Joseph Salisbury