Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to search an XML when parsing it using SAX in nokogiri

Tags:

ruby

nokogiri

sax

I have a simple but huge xml file like below. I want to parse it using SAX and only print out text between the title tag.

<root>
    <site>some site</site>
    <title>good title</title>
</root>

I have the following code:

require 'rubygems'
require 'nokogiri'
include Nokogiri

class PostCallbacks < XML::SAX::Document
  def start_element(element, attributes)
    if element == 'title'
      puts "found title"
    end
  end

  def characters(text)
    puts text
  end
end

parser = XML::SAX::Parser.new(PostCallbacks.new)
parser.parse_file("myfile.xml")

problem is that it prints text between all the tags. How can I just print text between the title tag?

like image 805
ralph Avatar asked Jan 21 '23 20:01

ralph


2 Answers

You just need to keep track of when you're inside a <title> so that characters knows when it should pay attention. Something like this (untested code) perhaps:

class PostCallbacks < XML::SAX::Document
  def initialize
    @in_title = false
  end

  def start_element(element, attributes)
    if element == 'title'
      puts "found title"
      @in_title = true
    end
  end

  def end_element(element)
    # Doesn't really matter what element we're closing unless there is nesting,
    # then you'd want "@in_title = false if element == 'title'"
    @in_title = false
  end

  def characters(text)
    puts text if @in_title
  end
end
like image 179
mu is too short Avatar answered Jan 23 '23 11:01

mu is too short


The accepted answer above is correct, however it has a drawback that it will go through the whole XML file even if it finds <title> right at the beginning.

I did have similar needs and I ended up writing a saxy ruby gem that is aimed to be efficient in such situations. Under the hood it implements Nokogiri's SAX Api.

Here's how you'd use it:

require 'saxy'
title = Saxy.parse(path_to_your_file, 'title').first

It will stop right when it finds first occurrence of <title> tag.

like image 38
Michał Szajbe Avatar answered Jan 23 '23 09:01

Michał Szajbe