I am new to XPath and it seems a bit tricky to me; Sometimes I find it is not working the way I am thinking it should work.
When I scrape data from a website using XPath and Nokogiri, I find it difficult if the website has a complex structure. I use FirePath to get the XPath of an element but sometimes it does not seem to work. I have to remove extra tags added by the browser, like tbody
.
I really want to know if there are some good tutorials and examples of XPath and Nokogiri. I could not find much after a Google search.
The biggest trick to finding an element, or group of elements, using Nokogiri or any XML/HTML parser, is to start with a short accessor to get into the general vicinity of what you're looking for, then iteratively add to it, fine-tuning as you go, until you have what you want.
The second trick is to remember to use //
to start your XPath, not /
, unless you're absolutely sure you want to start at the root of the document. //
is like a '**/*'
wildcard at the command-line in Linux. It searches everywhere.
Also, don't trust the XPath or CSS accessor provided by a browser. They do all sorts of fixups to the HTML source, including tbody
, like you saw. Instead, use Ruby's OpenURI or curl
or wget
to retrieve the raw source, and look at it with an editor like vi
or vim
, or use less
or cat
it to the screen. There's no chance of having any changes to the file that way.
Finally, it's often easier/faster to break the search into chunks with XPath, then let Ruby iterate through things, than to try to come up with a complex XPath that's harder to maintain or more fragile.
Nokogiri itself is pretty easy. The majority of things you'll want to do are simple combinations of two different methods: search
and at
. Both take either a CSS or XPath selector. search
, along with its sibling methods xpath
and css
, return a NodeSet
, which is basically an array of nodes that you can iterate over. at
, css_at
and xpath_at
return the first node that matches the CSS or XPath accessor. In all those methods, the ...xpath
variants accept an XPath, and the ...css
ones take a CSS accessor.
Once you have a node, generally you'll want to do one of two things to it, either extract a parameter or get its text/content. You can easily get the attributes using [attribute_to_get]
and the text using text
.
Using those methods we can search for all the links in a page and return their text and related href, using something like:
require 'awesome_print'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.example.com'))
ap doc.search('a').map{ |a| [a['href'], a.text] }[0, 5]
Which outputs:
[
[0] [
[0] "/",
[1] ""
],
[1] [
[0] "/domains/",
[1] "Domains"
],
[2] [
[0] "/numbers/",
[1] "Numbers"
],
[3] [
[0] "/protocols/",
[1] "Protocols"
],
[4] [
[0] "/about/",
[1] "About IANA"
]
]
I also found that there was a pretty steep learning curve using Nokogiri and XPath at the beginning, but after a lot of trial and error I've now managed to get the hang of both, so hang in there! Nokogiri is really powerful and well worth learning.
Regarding tutorials/examples, I assume you've seen the Nokogiri tutorials page. I can imagine that the level of those tutorials might be a bit high if you're not used to XPath, XML parsing etc.
Some other possible resources:
On XPath, I'd suggest reading this summary in five paragraphs. At its core XPath is fairly simple, just really unintuitive! I find CSS much easier to remember, and I don't think I'm the only one.
But in the end, while tutorials will help, the best thing you can do is to just crack open a console, require 'nokogiri'
and start plugging away. After a while it will just start making sense.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With