Does any open, simply extendible web crawler exists?

Tags:

I search for a web crawler solution which can is mature enough and can be simply extended. I am interested in the following features... or possibility to extend the crawler to meet them:

partly just to read the feeds of several sites
to scrape the content of these sites
if the site has an archive I would like to crawl and index it as well
the crawler should be capable to explore part of the Web for me and it should be able to decide which sites matches the given criteria
should be able to notify me, if things possibly matching my interest were found
the crawler should not kill the servers by attacking it by too many requests, it should be smart doing crawling
the crawler should be robust against freak sites and servers

Those things above can be done one by one without any big effort, but I am interested in any solution which provide a customisable, extendible crawler. I heard of Apache Nutch, but very unsure about the project so far. Do you have experiences with it? Can you recommend alternatives?

906

asked Jan 18 '10 10:01

fifigyuri

2 Answers

I've used Nutch extensively, when I was building the open source project index for my Krugle startup. It's hard to customize, being a fairly monolithic design. There is a plug-in architecture, but the interaction between plug-ins and the system is tricky and fragile.

As a result of that experience, and needing something with more flexibility, I started the Bixo project - a web mining toolkit. http://openbixo.org.

Whether it's right for you depends on the weighting of factors such as:

How much flexibility you need (+)
How mature it should be (-)
Whether you need the ability to scale (+)
If you're comfortable with Java/Hadoop (+)

answered Sep 18 '22 11:09

kkrugler

A quick search on GitHub threw up Anemone, a web spider framework which seems to fit your requirements - particularly extensiblility. Written in Ruby.
Hope it goes well!

answered Sep 20 '22 11:09

Joseph Salisbury

Related questions
                            
                                How to prevent getting blacklisted while scraping Amazon [closed]
                            
                                If I have a collection of random websites, how do I get specific information from each?
                            
                                Getting blocked when scraping Amazon (even with headers, proxies, delay) [closed]
                            
                                Extract class name in scrapy
                            
                                Using Scrapy in Jupyter notebook / accessing response directly
                            
                                Trouble getting the screenshot of any element after zooming in
                            
                                Python Web Scraping - urlopen error [Errno -2] Name or service not known
                            
                                the order of Scrapy Crawling URLs with long start_urls list and urls yiels from spider
                            
                                Malware infected sites list(only URL) [closed]
                            
                                confirm alert window in phantom.js
                            
                                Export scraping data in multiple formats using scrapy
                            
                                Scraping all text using Scrapy without knowing webpages' structure
                            
                                Python requests - 403 forbidden - despite setting `User-Agent` headers
                            
                                Trouble parsing tabular items from a graph located in a website
                            
                                How to scrape all topics from twitter
                            
                                Can't scrape product title from a webpage
                            
                                iTunesConnect Autoingest for financial earnings reports [closed]
                            
                                How can I render JavaScript HTML to HTML in python?
                            
                                Using Xpath in Scrapy to select any text below paragraph
                            
                                How to submit a form that seems to be handled by JavaScript using httr or rvest?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Does any open, simply extendible web crawler exists?

Tags:

web-scraping

web-crawler

nutch

fifigyuri

People also ask

2 Answers

kkrugler

Joseph Salisbury

Recent Activity

Donate For Us