I search for a web crawler solution which can is mature enough and can be simply extended. I am interested in the following features... or possibility to extend the crawler to meet them:
Those things above can be done one by one without any big effort, but I am interested in any solution which provide a customisable, extendible crawler. I heard of Apache Nutch, but very unsure about the project so far. Do you have experiences with it? Can you recommend alternatives?
Web Crawler CharacteristicsHigh HTTP request rate and typically done in parallel. Large amount of URL visits in terms of total number of URLs as well as the number of directories. More requests for specific file types versus others; for example, more requests for . html, .
The Hidden web refers to the collection of Web data which can be accessed by the crawler only through an interaction with the Web-based search form and not simply by traversing hyperlinks.
A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).
I've used Nutch extensively, when I was building the open source project index for my Krugle startup. It's hard to customize, being a fairly monolithic design. There is a plug-in architecture, but the interaction between plug-ins and the system is tricky and fragile.
As a result of that experience, and needing something with more flexibility, I started the Bixo project - a web mining toolkit. http://openbixo.org.
Whether it's right for you depends on the weighting of factors such as:
A quick search on GitHub threw up Anemone, a web spider framework which seems to fit your requirements - particularly extensiblility. Written in Ruby.
Hope it goes well!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With