Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are fast XML parsers for Ruby? [closed]

Tags:

parsing

xml

ruby

I am using Nokogiri which works for small documents well. But for a 180KB HTML file I have to increase the process stack size, via ulimit -s, and the parsing and XPath queries take a long time.

Are there faster methods available using a stock Ruby distribution?

I am getting used to XPath, but the solution does not necessarily need to support XPath.

The criteria are:

  1. Fast to write.
  2. Fast execution.
  3. Robust resulting parser.
like image 294
maxschlepzig Avatar asked Oct 27 '10 18:10

maxschlepzig


Video Answer


2 Answers

Check out the Ox gem. It is faster than LibXML and Nokogiri and supports in memory parsing as well as SAX callback parsing. Full disclosure, I wrote it.


In the performance comparison http://www.ohler.com/software/thoughts/Blog/Entries/2011/9/21_XML_with_Ruby.html both a DOM (in memory) and SAX (callback) parsers are compared.

like image 56
Peter Ohler Avatar answered Oct 23 '22 03:10

Peter Ohler


Nokogiri is based on libxml2, which is one of the fastest XML/HTML parsers in any language. It is written in C, but there are bindings in many languages.

The problem is that the more complex the file, the longer it takes to build a complete DOM structure in memory. Creating a DOM is slower and more memory-hungry than other parsing methods (generally the entire DOM must fit into memory). XPath relies on this DOM.

SAX is often what people turn to for speed or for large documents that don't fit into memory. It is more event driven: it notifies you of a start element, end element, etc, and you write handlers to react to them. It's a bit of a pain because you end up keeping track of state yourself (e.g. which elements you're "inside").

There is a middle ground: some parsers have a "pull parsing" capability where you have a cursor-like navigation. You still visit each node sequentially, but you can "fast-forward" to the end of an element you're not interested in. It's got the speed of SAX but a better interface for many uses. I don't know if Nokogiri can do this for HTML, but I'd look into its Reader API if you're interested.

Note that Nokogiri is also very lenient with malformed markup (such as real-world HTML) and this alone makes it a very good choice for HTML parsing.

like image 29
Mark Thomas Avatar answered Oct 23 '22 02:10

Mark Thomas