I have a very large XML file (300mb) of the following format:
<data>
<point>
<id><![CDATA[1371308]]></id>
<time><![CDATA[15:36]]></time>
</point>
<point>
<id><![CDATA[1371308]]></id>
<time><![CDATA[15:36]]></time>
</point>
<point>
<id><![CDATA[1371308]]></id>
<time><![CDATA[15:36]]></time>
</point>
</data>
Now I need to read it and iterate through the point
nodes doing something for each. Currently I'm doing it with Nokogiri like this:
require 'nokogiri'
xmlfeed = Nokogiri::XML(open("large_file.xml"))
xmlfeed.xpath("./data/point").each do |item|
save_id(item.xpath("./id").text)
end
However that's not very efficient, since it parses everything whole hug, and hence creating a huge memory footprint (several GB).
Is there a way to do this in chunks instead? Might be called streaming if I'm not mistaken?
EDIT
The suggested answer using nokogiris sax parser might be okay, but it gets very messy when there is several nodes within each point
that I need to extract content from and process differently. Instead of returning a huge array of entries for later processing, I would much rather prefer if I could access one point
at a time, process it, and then move on to the next "forgetting" the previous.
Given this little-known (but AWESOME) gist using Nokogiri's Reader interface, you should be able to do this:
Xml::Parser.new(Nokogiri::XML::Reader(open(file))) do
inside_element 'point' do
for_element 'id' do puts "ID: #{inner_xml}" end
for_element 'time' do puts "Time: #{inner_xml}" end
end
end
Someone should make this a gem, perhaps me ;)
Use Nokogiri::XML::SAX::Parser
(event-driven parser) and Nokogiri::XML::SAX::Document
:
require 'nokogiri'
class IDCollector < Nokogiri::XML::SAX::Document
attr :ids
def initialize
@ids = []
@inside_id = false
end
def start_element(name, attrs)
# NOTE: This is simplified. You need some kind of stack manipulations
# (push in start_element / pop in end_element)
# to correctly pick `.//data/point/id` elements.
@inside_id = true if name == 'id'
end
def end_element(name)
@inside_id = false
end
def cdata_block string
@ids << string if @inside_id
end
end
collector = IDCollector.new
parser = Nokogiri::XML::SAX::Parser.new(collector)
parser.parse(File.open('large_file.xml'))
p collector.ids # => ["1371308", "1371308", "1371308"]
According to the documentation,
Nokogiri::XML::SAX::Parser
: is a SAX style parser that reads its input as it deems necessary.
You can also use Nokogiri::XML::SAX::PushParser
if you need more control over the file input.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With