I have a simple but huge xml file like below. I want to parse it using SAX and only print out text between the title
tag.
<root>
<site>some site</site>
<title>good title</title>
</root>
I have the following code:
require 'rubygems'
require 'nokogiri'
include Nokogiri
class PostCallbacks < XML::SAX::Document
def start_element(element, attributes)
if element == 'title'
puts "found title"
end
end
def characters(text)
puts text
end
end
parser = XML::SAX::Parser.new(PostCallbacks.new)
parser.parse_file("myfile.xml")
problem is that it prints text between all the tags. How can I just print text between the title
tag?
You just need to keep track of when you're inside a <title>
so that characters
knows when it should pay attention. Something like this (untested code) perhaps:
class PostCallbacks < XML::SAX::Document
def initialize
@in_title = false
end
def start_element(element, attributes)
if element == 'title'
puts "found title"
@in_title = true
end
end
def end_element(element)
# Doesn't really matter what element we're closing unless there is nesting,
# then you'd want "@in_title = false if element == 'title'"
@in_title = false
end
def characters(text)
puts text if @in_title
end
end
The accepted answer above is correct, however it has a drawback that it will go through the whole XML file even if it finds <title>
right at the beginning.
I did have similar needs and I ended up writing a saxy ruby gem that is aimed to be efficient in such situations. Under the hood it implements Nokogiri's SAX Api.
Here's how you'd use it:
require 'saxy'
title = Saxy.parse(path_to_your_file, 'title').first
It will stop right when it finds first occurrence of <title>
tag.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With