I am looking at writing my own, but I am wondering if there are any good web crawlers out there which are written in Ruby.
Short of a full-blown web crawler, any gems that might be helpful in building a web crawler would be useful. I know this part of the question is touched upon in a couple of places, but a list of gems applicable to building a web crawler would be a great resource as well.
All search engines need to have crawlers, some examples are: Amazonbot is an Amazon web crawler for web content identification and backlink discovery. Baiduspider for Baidu. Bingbot for Bing search engine by Microsoft.
There are two types of crawlers: universal crawlers and topic crawlers [7]. A universal crawler downloads all pages irrespective of their contents, while a topic crawler downloads only pages of certain topics. The difficulty in topic crawling is how to recognize such pages.
I used to write spiders, page scrapers and site analyzers for my job, and still write them periodically to scratch some itch I get.
Ruby has some excellent gems to make it easy:
Nokogiri is my #1 choice for the HTML parser. I used to use Hpricot, but found some sites that made it explode in flames. I switched to Nokogiri afterwards and have been very happy with it. I regularly use it for parsing HTML, RDF/RSS/Atom and XML. Ox looks interesting too, so that might be another candidate, though I find searching the DOM a lot easier than trying to walk through a big hash, such as what is returned by Ox.
OpenURI is good as a simple HTTP client, but it can get in the way when you want to do more complex things or need to have multiple requests firing at once. I'd recommend looking at HTTPClient or Typhoeus with Hydra for modest to heavyweight jobs. Curb is good too, because it uses the cURL library, but the interface isn't as intuitive to me. It's worth looking at though. HTTPclient is also worth looking at, but I lean toward the previously mentioned ones.
Note: OpenURI has some flaws and vulnerabilities that can affect unsuspecting programmers so it's fallen out of favor somewhat. RestClient is a very worthy successor.
You'll need a backing database, and some way to talk to it. This isn't a task for Rails per se, but you could use ActiveRecord, detached from Rails, to talk to the database. I've done that a couple times and it works all right. Instead, I really like Sequel for my ORM. It's very flexible in how it lets you talk to the database, from using straight SQL to using Sequel's ability to programmatically build a query, to modeling the database and using migrations. Once you have the database built, you could use Rails to act as a front-end to the data though.
If you are going to navigate sites in any way beyond simply grabbing pages and following links, you'll want to look at Mechanize. It makes it easy to fill out forms and submit pages. As an added bonus, you can grab the content of a page as a Nokogiri HTML document and parse away using Nokogiri's multitude of tricks.
For massaging/mangling URLs I really like Addressable::URI. It's more full-featured than the built-in URI module. One thing that URI does that's nice is it has the URI#extract method to scan a string for URLs. If that string happened to be the body of a web page it would be an alternate way of locating links, but its downside is you'll also get links to images, videos, ads, etc., and you'll have to filter those out, probably resulting in more work than if you use a parser and look for <a>
tags exclusively. For that matter, Mechanize also has the links
method which returns all the links in a page, but you'll still have to filter them to determine whether you want to follow or ignore them.
If you think you'll need to deal with Javascript manipulated pages, or pages that get their content dynamically from AJAX, you should look into using one of the WATIR variants. There are flavors for the different browsers on different OSes, such as Firewatir, Safariwatir and Operawatir, so you'll have to figure out what works for you.
You do NOT want to rely on keeping your list of URLs to visit, or visited URLs, in memory. Design a database schema and store that information there. Spend some time up front designing the schema, thinking about what things you'll want to know as you collect links on a site. SQLite3, MySQL and Postgres are all excellent choices, depending on how big you think your database needs will be. One of my site analyzers was custom designed to help us recommend SEO changes for a Fortune 50 company. It ran for over three weeks covering about twenty different sites before we had enough data and stopped it. Imagine what would have happened if we had a power-outage and all that data went in the bit-bucket.
After all that you'll want to also make your code be aware of proper spidering etiquette: What are the key considerations when creating a web crawler?
I am building wombat, a Ruby DSL to crawl web pages and extract content. Check it out on github https://github.com/felipecsl/wombat
It is still in an early stage but is already functional with basic functionality. More stuff will be added really soon.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With