Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing large file with SaxMachine seems to be loading the whole file into memory

I have a 1.6gb xml file, and when I parse it with Sax Machine it does not seem to be streaming or eating the file in chunks - rather it appears to be loading the whole file into memory (or maybe there is a memory leak somewhere?) because my ruby process climbs upwards of 2.5gb of ram. I don't know where it stops growing because I ran out of memory.

On a smaller file (50mb) it also appears to be loading the whole file. My task iterates over the records in the xml file and saves each record to a database. It takes about 30 seconds of "idling" and then all of a sudden the database queries start executing.

I thought SAX was supposed to allow you to work with large files like this without loading the whole thing in memory.

Is there something I am overlooking?

Many thanks

Update to add code sample

class FeedImporter

  class FeedListing
    include ::SAXMachine

    element :id
    element :title
    element :description
    element :url

    def to_hash
      {}.tap do |hash|
        self.class.column_names.each do |key|
          hash[key] = send(key)
        end
      end
    end
  end

  class Feed
    include ::SAXMachine
    elements :listing, :as => :listings, :class => FeedListing
  end

  def perform
    open('~/feeds/large_feed.xml') do |file|

      # I think that SAXMachine is trying to load All of the listing elements into this one ruby object.
      puts 'Parsing'
      feed = Feed.parse(file)

      # We are now iterating over each of the listing elements, but they have been "parsed" from the feed already.
      puts 'Importing'
      feed.listings.each do |listing|
        Listing.import(listing.to_hash)
      end

    end
  end

end

As you can see, I don't care about the <listings> element in the feed. I just want the attributes of each <listing> element.

The output looks like this:

Parsing
... wait forever
Importing (actually, I don't ever see this on the big file (1.6gb) because too much memory is used :(
like image 745
jakeonrails Avatar asked Feb 08 '12 19:02

jakeonrails


3 Answers

Here's a Reader that will yield each listing's XML to a block, so you can process each Listing without loading the entire document into memory

reader = Nokogiri::XML::Reader(file)
while reader.read
  if reader.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT and reader.name == 'listing'
    listing = FeedListing.parse(reader.outer_xml)
    Listing.import(listing.to_hash)
  end
end

If listing elements could be nested, and you wanted to parse the outermost listings as single documents, you could do this:

require 'rubygems'
require 'nokogiri'


# Monkey-patch Nokogiri to make this easier
class Nokogiri::XML::Reader
  def element?
    node_type == TYPE_ELEMENT
  end

  def end_element?
    node_type == TYPE_END_ELEMENT
  end

  def opens?(name)
    element? && self.name == name
  end

  def closes?(name)
    (end_element? && self.name == name) || 
      (self_closing? && opens?(name))
  end

  def skip_until_close
    raise "node must be TYPE_ELEMENT" unless element?
    name_to_close = self.name

    if self_closing?
      # DONE!
    else
      level = 1
      while read
        level += 1 if opens?(name_to_close)
        level -= 1 if closes?(name_to_close)

        return if level == 0
      end
    end
  end

  def each_outer_xml(name, &block)
    while read
      if opens?(name)
        yield(outer_xml)
        skip_until_close
      end
    end
  end

end

once you have it monkey-patched, it's easy to deal with each listing individually:

open('~/feeds/large_feed.xml') do |file|
  reader = Nokogiri::XML::Reader(file)
  reader.each_outer_xml('listing') do |outer_xml|

    listing = FeedListing.parse(outer_xml)
    Listing.import(listing.to_hash)

  end
end
like image 53
John Douthat Avatar answered Sep 23 '22 01:09

John Douthat


Unfortunately there are now three different repos for sax-machine. And worse, the gemspec version was not bumped.

Despite the comment on Greg Weber's blog, I don't think this code was integrated into pauldix's or ezkl's forks. To use the lazy, fiber-based version of the code, I think you need to specifically reference gregweb's version in your gemfile like this:

gem 'sax-machine', :git => 'https://github.com/gregwebs/sax-machine'
like image 38
George Armhold Avatar answered Sep 23 '22 01:09

George Armhold


I forked sax-machine so that it uses constant memory: https://github.com/gregwebs/sax-machine

Good news: there is a new maintainer that is planning on merging my changes. Myself and the new maintainer have been using my fork without issue for a year now.

like image 21
Greg Weber Avatar answered Sep 25 '22 01:09

Greg Weber