In Perl there is an LWP module:
The libwww-perl collection is a set of Perl modules which provides a simple and consistent application programming interface (API) to the World-Wide Web. The main focus of the library is to provide classes and functions that allow you to write WWW clients. The library also contain modules that are of more general use and even classes that help you implement simple HTTP servers.
Is there a similar module (gem) for Ruby?
Update
Here is an example of a function I have made that extracts URL's from a specific website.
use LWP::UserAgent;
use HTML::TreeBuilder 3;
use HTML::TokeParser;
sub get_gallery_urls {
my $url = shift;
my $ua = LWP::UserAgent->new;
$ua->agent("$0/0.1 " . $ua->agent);
$ua->agent("Mozilla/8.0");
my $req = new HTTP::Request 'GET' => "$url";
$req->header('Accept' => 'text/html');
# send request
$response_u = $ua->request($req);
die "Error: ", $response_u->status_line unless $response_u->is_success;
my $root = HTML::TreeBuilder->new;
$root->parse($response_u->content);
my @gu = $root->find_by_attribute("id", "thumbnails");
my %urls = ();
foreach my $g (@gu) {
my @as = $g->find_by_tag_name('a');
foreach $a (@as) {
my $u = $a->attr("href");
if ($u =~ /^\//) {
$urls{"http://example.com"."$u"} = 1;
}
}
}
return %urls;
}
The closest match is probably httpclient, which aims to be the equivalent of LWP. However, depending on what you plan to do, there may be better options. If you plan to follow links, fill out forms, etc. in order to scrape web content, you can use Mechanize which is similar to the perl module by the same name. There are also more Ruby-specific gems, such as the excellent Rest-client and HTTParty (my personal favorite). See the HTTP Clients category of Ruby Toolbox for a larger list.
Update: Here's an example of how to find all links on a page in Mechanize (Ruby, but it would be similar in Perl):
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://example.com/')
page.links.each do |link|
puts link.text
end
P.S. As an ex-Perler myself, I used to worry about abandoning the excellent CPAN--would I paint myself into a corner with Ruby? Would I not be able to find an equivalent to a module I rely on? This has turned out not to be a problem at all, and in fact lately has been quite the opposite: Ruby (along with Python) tends to be the first to get client support for new platforms/web services, etc.
Here's what your function might look like in ruby.
require 'rubygems'
require "mechanize"
def get_gallery_urls url
ua = Mechanize.new
ua.user_agent = "Mozilla/8.0"
urls = {}
doc = ua.get url
doc.search("#thumbnails a").each do |a|
u = a["href"]
urls["http://example.com#{u}"] = 1 if u =~ /^\//
end
urls
end
Much nicer :)
I used Perl for years and years, and liked LWP. It was a great tool. However, here's how I'd go about extracting URLs from a page. This isn't spidering a site, but that'd be an easy thing:
require 'open-uri'
require 'uri'
urls = URI.extract(open('http://example.com').read)
puts urls
With the resulting output looking like:
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd http://www.w3.org/1999/xhtml http://www.icann.org/ mailto:[email protected]?subject=General%20website%20feedback
Writing that as a method:
require 'open-uri'
require 'uri'
def get_gallery_urls(url)
URI.extract(open(url).read)
end
or, closer to the original function while doing it the Ruby-way:
def get_gallery_urls(url)
URI.extract(open(url).read).map{ |u|
URI.parse(u).host ? u : URI.join(url, u).to_s
}
end
or, following closer to the original code:
require 'nokogiri'
require 'open-uri'
require 'uri'
def get_gallery_urls(url)
Nokogiri::HTML(
open(url)
)
.at('#thumbnails')
.search('a')
.map{ |link|
href = link['href']
URI.parse(link[href]).host \
? href \
: URI.join(url, href).to_s
}
end
One of the things that attracted me to Ruby is its ability to be readable, while still being concise.
If you want to roll your own TCP/IP-based functions, Ruby's standard Net library is the starting point. By default you get:
net/ftp net/http net/imap net/pop net/smtp net/telnet
with the SSL-based ssh, scp, sftp and others available as gems. Use gem search net -r | grep ^net-
to see a short list.
This is more of an answer for anyone looking at this question and needing to know what are easier/better/different alternatives to general web scraping with Perl compared to using LWP
(and even WWW::Mechanize
).
Here is a quick selection of web scraping modules on CPAN
:
Mojo::UserAgent
pQuery
Scrappy
Web::Magic
Web::Scraper
Web::Query
NB. Above is just in alphabetical order so please choose your favourite poison :)
For most of my recent web scraping I've been using pQuery
. You can see there are quite a few examples of usage on SO.
Below is your get_gallery_urls
example using pQuery
:
use strict;
use warnings;
use pQuery;
sub get_gallery_urls {
my $url = shift;
my %urls;
pQuery($url)
->find("#thumbnails a")
->each( sub {
my $u = $_->getAttribute('href');
$urls{'http://example.com' . $u} = 1 if $u =~ /^\//;
});
return %urls;
}
PS. As Daxim has said in the comments there are plenty of excellent Perl tools for web scraping. The hardest part is just making a choice of which one to use!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With