Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting all links of a webpage using Ruby

I'm trying to retrieve every external link of a webpage using Ruby. I'm using String.scan with this regex:

/href="https?:[^"]*|href='https?:[^']*/i

Then, I can use gsub to remove the href part:

str.gsub(/href=['"]/)

This works fine, but I'm not sure if it's efficient in terms of performance. Is this OK to use or I should work with a more specific parser (nokogiri, for example)? Which way is better?

Thanks!

like image 287
Fábio Perez Avatar asked Jul 14 '11 21:07

Fábio Perez


2 Answers

Using regular expressions is fine for a quick and dirty script, but Nokogiri is very simple to use:

require 'nokogiri'
require 'open-uri'

fail("Usage: extract_links URL [URL ...]") if ARGV.empty?

ARGV.each do |url|
  doc = Nokogiri::HTML(open(url))
  hrefs = doc.css("a").map do |link|
    if (href = link.attr("href")) && !href.empty?
      URI::join(url, href)
    end
  end.compact.uniq
  STDOUT.puts(hrefs.join("\n"))
end

If you want just the method, refactor it a little bit to your needs:

def get_links(url)
  Nokogiri::HTML(open(url).read).css("a").map do |link|
    if (href = link.attr("href")) && href.match(/^https?:/)
      href
    end
  end.compact
end
like image 159
tokland Avatar answered Oct 12 '22 16:10

tokland


I'm a big fan of Nokogiri, but why reinvent the wheel?

Ruby's URI module already has the extract method to do this:

URI::extract(str[, schemes][,&blk])

From the docs:

Extracts URIs from a string. If block given, iterates through all matched URIs. Returns nil if block given or array with matches.

require "uri"

URI.extract("text here http://foo.example.org/bla and here mailto:[email protected] and here also.")
# => ["http://foo.example.com/bla", "mailto:[email protected]"]

You could use Nokogiri to walk the DOM and pull all the tags that have URLs, or have it retrieve just the text and pass it to URI.extract, or just let URI.extract do it all.

And, why use a parser, such as Nokogiri, instead of regex patterns? Because HTML, and XML, can be formatted in a lot of different ways and still render correctly on the page or effectively transfer the data. Browsers are very forgiving when it comes to accepting bad markup. Regex patterns, on the other hand, work in very limited ranges of "acceptability", where that range is defined by how well you anticipate the variations in the markup, or, conversely, how well you anticipate the ways your pattern can go wrong when presented with unexpected patterns.

A parser doesn't work like a regex. It builds an internal representation of the document and then walks through that. It doesn't care how the file/markup is laid out, it does its work on the internal representation of the DOM. Nokogiri relaxes its parsing to handle HTML, because HTML is notorious for being poorly written. That helps us because with most non-validating HTML Nokogiri can fix it up. Occasionally I'll encounter something that is SO badly written that Nokogiri can't fix it correctly, so I'll have to give it a minor nudge by tweaking the HTML before I pass it to Nokogiri; I'll still use the parser though, rather than try to use patterns.

like image 25
the Tin Man Avatar answered Oct 12 '22 14:10

the Tin Man