Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I use Nokogiri::XML::Reader to parse large XML files?

I'm trying to use Ruby's Nokogiri to parse large (1 GB or more) XML files. I'm testing code on a smaller file, containing only 4 records available here. I'm using Nokogiri version 1.5.0, Ruby 1.8.7 on Ubuntu 10.10. Since I don't understand SAX very well, I'm trying Nokogiri::XML::Reader to start.

My first attempt, to retrieve the content of the PMID tag, looks like this:

#!/usr/bin/ruby
require "rubygems"
require "nokogiri"

file   = ARGV[0]
reader = Nokogiri::XML::Reader(File.open(file))
p      = []
reader.each do |node|
  if node.name == "PMID"
    p << node.inner_xml
  end
end

puts p.inspect

Here's what I hoped to see:

["21714156", "21693734", "21692271", "21692260"]

Here's what I actually saw:

["21714156", "", "21693734", "", "21692271", "", "21692260", ""]

It seems that for some reason, my code is finding, or generating, an extra, empty PMID tag for every instance of PMID. Either that or inner_xml does not work as I thought.

I'd be grateful if anyone could confirm that my code and data generates the result shown and suggest where I'm going wrong.

like image 324
neilfws Avatar asked Jul 13 '11 06:07

neilfws


1 Answers

Each element in the stream comes through as two events: one to open the element and one to close it. The opening event will have

node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT

and the closing event will have

node.node_type == Nokogiri::XML::Reader::TYPE_END_ELEMENT

The empty strings you're seeing are just the element closing events. Remember that with SAX parsing, you're basically walking through a tree so you need the second event to tell you when you're going back up and closing an element.

You probably want something more like this:

reader.each do |node|
  if node.name == "PMID" && node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
    p << node.inner_xml
  end
end

Or perhaps:

reader.each do |node|
  next if node.name      != 'PMID'
  next if node.node_type != Nokogiri::XML::Reader::TYPE_ELEMENT
  p << node.inner_xml
end

Or some other variation on that.

like image 68
mu is too short Avatar answered Oct 06 '22 00:10

mu is too short