Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

XPath or CSS parsing faster (for Nokogiri on HTML files)?

I would like to know if Nokogiri XPath or CSS parsing works faster with HTML files. How is the speed different?

like image 840
TonyTakeshi Avatar asked Nov 21 '11 15:11

TonyTakeshi


1 Answers

Nokogiri doesn't have XPath or CSS parsing. It parses XML/HTML into a single DOM that you can then use CSS or XPath syntax to query.

CSS selectors are internally turned into XPath before asking libxml2 to perform the query. As such (for the exact same selectors) the XPath version would be a tiny fraction faster, since the CSS does not need to be converted into XPath first.

However, your question has no general answer; it depends on what you are selecting for, and what your XPath looks like. Chances are, you wouldn't be writing the same XPath as Nokogiri creates. For example, see if you can guess the XPath for the following two CSS statements:

puts Nokogiri::CSS.xpath_for('#foo')
#=> //*[@id = 'foo']


puts Nokogiri::CSS.xpath_for 'div.article a.external'
#=> //div[contains(concat(' ', @class, ' '), ' article ')]//a[contains(concat(' ', @class, ' '), ' external ')]

Unlike a Web browser, id and class attributes have no sped-up cache, so selecting for them does not help. Indeed, the general interpretation of div.article involves far more work than something like div[@class='article'].

As @LBg commented, you should benchmark for yourself if absolute speed is critical.

However, I would suggest this: don't worry about it. Computers are fast. Write what is most convenient for you, the programmer. If a CSS selector is easier to craft, faster to type, and easier to understand when reviewing your code later, use that. Use XPath when you need to do things that you cannot do with the CSS selector syntax.

How long does it take Nokogiri to convert a reasonably complex CSS to XPath?

t = Time.now
1000.times do |i|
  # Use a different CSS string each time to avoid built-in caching
  css = "body#foo table#bar#{i} thead th, body#foo table#bar#{i} tbody td"
  Nokogiri::CSS.xpath_for(css)
end
puts (Time.now - t)/1000
#=> 0.000405041

Less than half a millisecond.

like image 74
Phrogz Avatar answered Oct 09 '22 01:10

Phrogz