Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse/read Large XML file with minimal memory footprint

I have a very large XML file (300mb) of the following format:

<data>
 <point>
  <id><![CDATA[1371308]]></id>
  <time><![CDATA[15:36]]></time>
 </point>
 <point>
  <id><![CDATA[1371308]]></id>
  <time><![CDATA[15:36]]></time>
 </point>
 <point>
  <id><![CDATA[1371308]]></id>
  <time><![CDATA[15:36]]></time>
 </point>
</data>

Now I need to read it and iterate through the point nodes doing something for each. Currently I'm doing it with Nokogiri like this:

require 'nokogiri'
xmlfeed = Nokogiri::XML(open("large_file.xml"))
xmlfeed.xpath("./data/point").each do |item|
  save_id(item.xpath("./id").text)
end

However that's not very efficient, since it parses everything whole hug, and hence creating a huge memory footprint (several GB).

Is there a way to do this in chunks instead? Might be called streaming if I'm not mistaken?

EDIT

The suggested answer using nokogiris sax parser might be okay, but it gets very messy when there is several nodes within each point that I need to extract content from and process differently. Instead of returning a huge array of entries for later processing, I would much rather prefer if I could access one point at a time, process it, and then move on to the next "forgetting" the previous.

like image 594
Niels Kristian Avatar asked Jan 16 '14 14:01

Niels Kristian


2 Answers

Given this little-known (but AWESOME) gist using Nokogiri's Reader interface, you should be able to do this:

Xml::Parser.new(Nokogiri::XML::Reader(open(file))) do
  inside_element 'point' do
    for_element 'id' do puts "ID: #{inner_xml}" end
    for_element 'time' do puts "Time: #{inner_xml}" end
  end
end

Someone should make this a gem, perhaps me ;)

like image 124
Mark Thomas Avatar answered Nov 14 '22 20:11

Mark Thomas


Use Nokogiri::XML::SAX::Parser (event-driven parser) and Nokogiri::XML::SAX::Document:

require 'nokogiri'

class IDCollector < Nokogiri::XML::SAX::Document
  attr :ids

  def initialize
    @ids = []
    @inside_id = false
  end

  def start_element(name, attrs)
    # NOTE: This is simplified. You need some kind of stack manipulations
    #                           (push in start_element / pop in end_element)
    #    to correctly pick `.//data/point/id` elements.
    @inside_id = true if name == 'id'
  end
  def end_element(name)
    @inside_id = false
  end

  def cdata_block string
    @ids << string if @inside_id
  end
end

collector = IDCollector.new
parser = Nokogiri::XML::SAX::Parser.new(collector)
parser.parse(File.open('large_file.xml'))
p collector.ids # => ["1371308", "1371308", "1371308"]

According to the documentation,

Nokogiri::XML::SAX::Parser: is a SAX style parser that reads its input as it deems necessary.

You can also use Nokogiri::XML::SAX::PushParser if you need more control over the file input.

like image 33
falsetru Avatar answered Nov 14 '22 22:11

falsetru