Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a module like Perl's LWP for Ruby?

In Perl there is an LWP module:

The libwww-perl collection is a set of Perl modules which provides a simple and consistent application programming interface (API) to the World-Wide Web. The main focus of the library is to provide classes and functions that allow you to write WWW clients. The library also contain modules that are of more general use and even classes that help you implement simple HTTP servers.

Is there a similar module (gem) for Ruby?

Update

Here is an example of a function I have made that extracts URL's from a specific website.

use LWP::UserAgent;
use HTML::TreeBuilder 3;
use HTML::TokeParser;

sub get_gallery_urls {
    my $url = shift;

    my $ua = LWP::UserAgent->new;
    $ua->agent("$0/0.1 " . $ua->agent);
    $ua->agent("Mozilla/8.0");

    my $req = new HTTP::Request 'GET' => "$url";
    $req->header('Accept' => 'text/html');

    # send request
    $response_u = $ua->request($req);

    die "Error: ", $response_u->status_line unless $response_u->is_success;

    my $root = HTML::TreeBuilder->new;
    $root->parse($response_u->content);

    my @gu = $root->find_by_attribute("id", "thumbnails");

    my %urls = ();

    foreach my $g (@gu) {
        my @as = $g->find_by_tag_name('a');

        foreach $a (@as) {
            my $u = $a->attr("href");

            if ($u =~ /^\//) {
                $urls{"http://example.com"."$u"} = 1;
            }
        }
    }

    return %urls;
}
like image 593
Sandra Schlichting Avatar asked Nov 25 '11 21:11

Sandra Schlichting


4 Answers

The closest match is probably httpclient, which aims to be the equivalent of LWP. However, depending on what you plan to do, there may be better options. If you plan to follow links, fill out forms, etc. in order to scrape web content, you can use Mechanize which is similar to the perl module by the same name. There are also more Ruby-specific gems, such as the excellent Rest-client and HTTParty (my personal favorite). See the HTTP Clients category of Ruby Toolbox for a larger list.

Update: Here's an example of how to find all links on a page in Mechanize (Ruby, but it would be similar in Perl):

require 'rubygems'
require 'mechanize'

agent = Mechanize.new

page = agent.get('http://example.com/')

page.links.each do |link|
  puts link.text
end

P.S. As an ex-Perler myself, I used to worry about abandoning the excellent CPAN--would I paint myself into a corner with Ruby? Would I not be able to find an equivalent to a module I rely on? This has turned out not to be a problem at all, and in fact lately has been quite the opposite: Ruby (along with Python) tends to be the first to get client support for new platforms/web services, etc.

like image 121
Mark Thomas Avatar answered Sep 28 '22 17:09

Mark Thomas


Here's what your function might look like in ruby.

require 'rubygems'
require "mechanize"

def get_gallery_urls url
    ua = Mechanize.new
    ua.user_agent = "Mozilla/8.0"
    urls = {}

    doc = ua.get url
    doc.search("#thumbnails a").each do |a|
        u = a["href"]
        urls["http://example.com#{u}"] = 1 if u =~ /^\//
    end

    urls
end

Much nicer :)

like image 33
pguardiario Avatar answered Sep 28 '22 17:09

pguardiario


I used Perl for years and years, and liked LWP. It was a great tool. However, here's how I'd go about extracting URLs from a page. This isn't spidering a site, but that'd be an easy thing:

require 'open-uri'
require 'uri'

urls = URI.extract(open('http://example.com').read)
puts urls

With the resulting output looking like:

http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
http://www.w3.org/1999/xhtml
http://www.icann.org/
mailto:[email protected]?subject=General%20website%20feedback

Writing that as a method:

require 'open-uri'
require 'uri'

def get_gallery_urls(url)
  URI.extract(open(url).read)
end

or, closer to the original function while doing it the Ruby-way:

def get_gallery_urls(url)
  URI.extract(open(url).read).map{ |u| 
    URI.parse(u).host ? u : URI.join(url, u).to_s
  }
end

or, following closer to the original code:

require 'nokogiri'
require 'open-uri'
require 'uri'

def get_gallery_urls(url)
  Nokogiri::HTML(
    open(url)
  )
    .at('#thumbnails')
    .search('a')
    .map{ |link|
      href = link['href']
      URI.parse(link[href]).host \
        ? href \
        : URI.join(url, href).to_s
    }
end

One of the things that attracted me to Ruby is its ability to be readable, while still being concise.

If you want to roll your own TCP/IP-based functions, Ruby's standard Net library is the starting point. By default you get:

net/ftp
net/http
net/imap
net/pop
net/smtp
net/telnet

with the SSL-based ssh, scp, sftp and others available as gems. Use gem search net -r | grep ^net- to see a short list.

like image 44
the Tin Man Avatar answered Sep 28 '22 17:09

the Tin Man


This is more of an answer for anyone looking at this question and needing to know what are easier/better/different alternatives to general web scraping with Perl compared to using LWP (and even WWW::Mechanize).

Here is a quick selection of web scraping modules on CPAN:

  • Mojo::UserAgent
  • pQuery
  • Scrappy
  • Web::Magic
  • Web::Scraper
  • Web::Query

NB. Above is just in alphabetical order so please choose your favourite poison :)

For most of my recent web scraping I've been using pQuery. You can see there are quite a few examples of usage on SO.

Below is your get_gallery_urls example using pQuery:

use strict;
use warnings;
use pQuery;

sub get_gallery_urls {
    my $url = shift;
    my %urls;

    pQuery($url)
        ->find("#thumbnails a")
        ->each( sub {
            my $u = $_->getAttribute('href');
            $urls{'http://example.com' . $u} = 1 if $u =~ /^\//;
        });

    return %urls;
}

PS. As Daxim has said in the comments there are plenty of excellent Perl tools for web scraping. The hardest part is just making a choice of which one to use!

like image 35
draegtun Avatar answered Sep 28 '22 16:09

draegtun