I have a large XML file (about 10K rows) I need to parse regularly that is in this format:
<summarysection>
<totalcount>10000</totalcount>
</summarysection>
<items>
<item>
<cat>Category</cat>
<name>Name 1</name>
<value>Val 1</value>
</item>
...... 10,000 more times
</items>
What I'd like to do is parse each of the individual nodes using nokogiri to count the amount of items in one category. Then, I'd like to subtract that number from the total_count to get an ouput that reads "Count of Interest_Category: n, Count of All Else: z".
This is my code now:
#!/usr/bin/ruby
require 'rubygems'
require 'nokogiri'
require 'open-uri'
icount = 0
xmlfeed = Nokogiri::XML(open("/path/to/file/all.xml"))
all_items = xmlfeed.xpath("//items")
all_items.each do |adv|
if (adv.children.filter("cat").first.child.inner_text.include? "partofcatname")
icount = icount + 1
end
end
othercount = xmlfeed.xpath("//totalcount").inner_text.to_i - icount
puts icount
puts othercount
This seems to work, but is very slow! I'm talking more than 10 minutes for 10,000 items. Is there a better way to do this? Am I doing something in a less than optimal fashion?
Here is the example: XML parsing can be done in ruby with the help of a gem called Nokogiri. Nokogiri is an HTML, XML, SAX, and Reader parser. Among Nokogiri’s many features is the ability to search documents via XPath or CSS3 selectors.
If you want to use XSLT4R from within an application, you can include XSLT and input the parameters you need. Here is the example: XML parsing can be done in ruby with the help of a gem called Nokogiri. Nokogiri is an HTML, XML, SAX, and Reader parser. Among Nokogiri’s many features is the ability to search documents via XPath or CSS3 selectors.
Depending on your environment, Oga may be better suited as a fast enough XML parsers for Ruby with a much better interface and faster installation time. Not the answer you're looking for?
Nokogiri is based on libxml2, which is one of the fastest XML/HTML parsers in any language. It is written in C, but there are bindings in many languages. The problem is that the more complex the file, the longer it takes to build a complete DOM structure in memory.
Here's an example comparing a SAX parser count with a DOM-based count, counting 500,000 <item>
s with one of seven categories. First, the output:
Create XML file: 1.7s
Count via SAX: 12.9s
Create DOM: 1.6s
Count via DOM: 2.5s
Both techniques produce the same hash counting the number of each category seen:
{"Cats"=>71423, "Llamas"=>71290, "Pigs"=>71730, "Sheep"=>71491, "Dogs"=>71331, "Cows"=>71536, "Hogs"=>71199}
The SAX version takes 12.9s to count and categorize, while the DOM version takes only 1.6s to create the DOM elements and 2.5s more to find and categorize all the <cat>
values. The DOM version is around 3x as fast!
…but that's not the entire story. We have to look at RAM usage as well.
I had enough memory on my machine to handle 1,000,000 items, but at 2,000,000 I ran out of RAM and had to start using virtual memory. Even with an SSD and a fast machine I let the DOM code run for almost ten minutes before finally killing it.
It is very likely that the long times you are reporting are because you are running out of RAM and hitting the disk continuously as part of virtual memory. If you can fit the DOM into memory, use it, as it is FAST. If you can't, however, you really have to use the SAX version.
Here's the test code:
require 'nokogiri'
CATEGORIES = %w[ Cats Dogs Hogs Cows Sheep Pigs Llamas ]
ITEM_COUNT = 500_000
def test!
create_xml
sleep 2; GC.start # Time to read memory before cleaning the slate
test_sax
sleep 2; GC.start # Time to read memory before cleaning the slate
test_dom
end
def time(label)
t1 = Time.now
yield.tap{ puts "%s: %.1fs" % [ label, Time.now-t1 ] }
end
def test_sax
item_counts = time("Count via SAX") do
counter = CategoryCounter.new
# Use parse_file so we can stream data from disk instead of flooding RAM
Nokogiri::HTML::SAX::Parser.new(counter).parse_file('tmp.xml')
counter.category_counts
end
# p item_counts
end
def test_dom
doc = time("Create DOM"){ File.open('tmp.xml','r'){ |f| Nokogiri.XML(f) } }
counts = time("Count via DOM") do
counts = Hash.new(0)
doc.xpath('//cat').each do |cat|
counts[cat.children[0].content] += 1
end
counts
end
# p counts
end
class CategoryCounter < Nokogiri::XML::SAX::Document
attr_reader :category_counts
def initialize
@category_counts = Hash.new(0)
end
def start_element(name,att=nil)
@count = name=='cat'
end
def characters(str)
if @count
@category_counts[str] += 1
@count = false
end
end
end
def create_xml
time("Create XML file") do
File.open('tmp.xml','w') do |f|
f << "<root>
<summarysection><totalcount>10000</totalcount></summarysection>
<items>
#{
ITEM_COUNT.times.map{ |i|
"<item>
<cat>#{CATEGORIES.sample}</cat>
<name>Name #{i}</name>
<name>Value #{i}</name>
</item>"
}.join("\n")
}
</items>
</root>"
end
end
end
test! if __FILE__ == $0
If we strip away some of the test structure, the DOM-based counter looks like this:
# Open the file on disk and pass it to Nokogiri so that it can stream read;
# Better than doc = Nokogiri.XML(IO.read('tmp.xml'))
# which requires us to load a huge string into memory just to parse it
doc = File.open('tmp.xml','r'){ |f| Nokogiri.XML(f) }
# Create a hash with default '0' values for any 'missing' keys
counts = Hash.new(0)
# Find every `<cat>` element in the document (assumes one per <item>)
doc.xpath('//cat').each do |cat|
# Get the child text node's content and use it as the key to the hash
counts[cat.children[0].content] += 1
end
First, let's focus on this code:
class CategoryCounter < Nokogiri::XML::SAX::Document
attr_reader :category_counts
def initialize
@category_counts = Hash.new(0)
end
def start_element(name,att=nil)
@count = name=='cat'
end
def characters(str)
if @count
@category_counts[str] += 1
@count = false
end
end
end
When we create a new instance of this class we get an object that has a Hash that defaults to 0 for all values, and a couple of methods that can be called on it. The SAX Parser will call these methods as it runs through the document.
Each time the SAX parser sees a new element it will call the start_element
method on this class. When that happens, we set a flag based on whether this element is named "cat" or not (so that we can find the name of it later).
Each time the SAX parser slurps up a chunk of text it calls the characters
method of our object. When that happens, we check to see if the last element we saw was a category (i.e. if @count
was set to true
); if so, we use the value of this text node as the category name and add one to our counter.
To use our custom object with Nokogiri's SAX parser we do this:
# Create a new instance, with its empty hash
counter = CategoryCounter.new
# Create a new parser that will call methods on our object, and then
# use `parse_file` so that it streams data from disk instead of flooding RAM
Nokogiri::HTML::SAX::Parser.new(counter).parse_file('tmp.xml')
# Once that's done, we can get the hash of category counts back from our object
counts = counter.category_counts
p counts["Pigs"]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With