(Update: This answer is written from the point of view of Nokogiri, but it's also useful if you're looking for the XPATH or CSS syntax for specific queries.)
I love Nokogiri -- it's a wonderful tool for extracting elements from XML and HTML documents. Although the online examples are good, they mostly show how to manipulate XML documents.
How can you extract extract links and link text from HTML using Nokogiri?
This is a mini-example originally written in response to Getting attribute's value in Nokogiri to extract link URLs, extracted here in Community Wiki style for easy reference.
Here are some common operations you might do when parsing links in HTTP, shown both in css
and xpath
syntax.
Starting with with this snippet:
require 'rubygems'
require 'nokogiri'
html = <<HTML
<div id="block1">
<a href="http://google.com">link1</a>
</div>
<div id="block2">
<a href="http://stackoverflow.com">link2</a>
<a id="tips">just a bookmark</a>
</div>
HTML
doc = Nokogiri::HTML(html)
We can use xpath or css to find all the <a>
elements and then keep only the ones that have an href
attribute:
nodeset = doc.xpath('//a') # Get all anchors via xpath
nodeset.map {|element| element["href"]}.compact # => ["http://google.com", "http://stackoverflow.com"]
nodeset = doc.css('a') # Get all anchors via css
nodeset.map {|element| element["href"]}.compact # => ["http://google.com", "http://stackoverflow.com"]
In the above cases, the .compact
is necessary because the search for the <a>
element returns the "just a bookmark" element in addition to the others.
But we can use a more refined search to find just the elements that contain an href
attribute:
attrs = doc.xpath('//a/@href') # Get anchors w href attribute via xpath
attrs.map {|attr| attr.value} # => ["http://google.com", "http://stackoverflow.com"]
nodeset = doc.css('a[href]') # Get anchors w href attribute via css
nodeset.map {|element| element["href"]} # => ["http://google.com", "http://stackoverflow.com"]
To find a link within the <div id="block2">
nodeset = doc.xpath('//div[@id="block2"]/a/@href')
nodeset.first.value # => "http://stackoverflow.com"
nodeset = doc.css('div#block2 a[href]')
nodeset.first['href'] # => "http://stackoverflow.com"
If you know you're searching for just one link, you can use at_xpath
or at_css
instead:
attr = doc.at_xpath('//div[@id="block2"]/a/@href')
attr.value # => "http://stackoverflow.com"
element = doc.at_css('div#block2 a[href]')
element['href'] # => "http://stackoverflow.com"
What if you know the text associated with a link and want to find its url? A little xpath-fu (or css-fu) comes in handy:
element = doc.at_xpath('//a[text()="link2"]')
element["href"] # => "http://stackoverflow.com"
element = doc.at_css('a:contains("link2")')
element["href"] # => "http://stackoverflow.com"
For completeness, here's how you'd get the text associated with a particular link:
element = doc.at_xpath('//a[@href="http://stackoverflow.com"]')
element.text # => "link2"
element = doc.at_css('a[href="http://stackoverflow.com"]')
element.text # => "link2"
In addition to the extensive Nokorigi documentation, I came across some useful links while writing this up:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With