Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

XPath along with nokogiri; tutorials/examples? [closed]

I am new to XPath and it seems a bit tricky to me; Sometimes I find it is not working the way I am thinking it should work.

When I scrape data from a website using XPath and Nokogiri, I find it difficult if the website has a complex structure. I use FirePath to get the XPath of an element but sometimes it does not seem to work. I have to remove extra tags added by the browser, like tbody.

I really want to know if there are some good tutorials and examples of XPath and Nokogiri. I could not find much after a Google search.

like image 953
K M Rakibul Islam Avatar asked Oct 25 '12 14:10

K M Rakibul Islam


2 Answers

The biggest trick to finding an element, or group of elements, using Nokogiri or any XML/HTML parser, is to start with a short accessor to get into the general vicinity of what you're looking for, then iteratively add to it, fine-tuning as you go, until you have what you want.

The second trick is to remember to use // to start your XPath, not /, unless you're absolutely sure you want to start at the root of the document. // is like a '**/*' wildcard at the command-line in Linux. It searches everywhere.

Also, don't trust the XPath or CSS accessor provided by a browser. They do all sorts of fixups to the HTML source, including tbody, like you saw. Instead, use Ruby's OpenURI or curl or wget to retrieve the raw source, and look at it with an editor like vi or vim, or use less or cat it to the screen. There's no chance of having any changes to the file that way.

Finally, it's often easier/faster to break the search into chunks with XPath, then let Ruby iterate through things, than to try to come up with a complex XPath that's harder to maintain or more fragile.

Nokogiri itself is pretty easy. The majority of things you'll want to do are simple combinations of two different methods: search and at. Both take either a CSS or XPath selector. search, along with its sibling methods xpath and css, return a NodeSet, which is basically an array of nodes that you can iterate over. at, css_at and xpath_at return the first node that matches the CSS or XPath accessor. In all those methods, the ...xpath variants accept an XPath, and the ...css ones take a CSS accessor.

Once you have a node, generally you'll want to do one of two things to it, either extract a parameter or get its text/content. You can easily get the attributes using [attribute_to_get] and the text using text.

Using those methods we can search for all the links in a page and return their text and related href, using something like:

require 'awesome_print'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.example.com'))
ap doc.search('a').map{ |a| [a['href'], a.text] }[0, 5]

Which outputs:

[
    [0] [
        [0] "/",
        [1] ""
    ],
    [1] [
        [0] "/domains/",
        [1] "Domains"
    ],
    [2] [
        [0] "/numbers/",
        [1] "Numbers"
    ],
    [3] [
        [0] "/protocols/",
        [1] "Protocols"
    ],
    [4] [
        [0] "/about/",
        [1] "About IANA"
    ]
]
like image 53
the Tin Man Avatar answered Nov 07 '22 04:11

the Tin Man


I also found that there was a pretty steep learning curve using Nokogiri and XPath at the beginning, but after a lot of trial and error I've now managed to get the hang of both, so hang in there! Nokogiri is really powerful and well worth learning.

Regarding tutorials/examples, I assume you've seen the Nokogiri tutorials page. I can imagine that the level of those tutorials might be a bit high if you're not used to XPath, XML parsing etc.

Some other possible resources:

  • Getting Started with Nokogiri
  • Getting Started with Nokogiri and XML in Ruby
  • How do I use XPath in Nokogiri?

On XPath, I'd suggest reading this summary in five paragraphs. At its core XPath is fairly simple, just really unintuitive! I find CSS much easier to remember, and I don't think I'm the only one.

But in the end, while tutorials will help, the best thing you can do is to just crack open a console, require 'nokogiri' and start plugging away. After a while it will just start making sense.

like image 9
Chris Salzberg Avatar answered Nov 07 '22 05:11

Chris Salzberg