Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Nokogiri and finding element by name

I am parsing an XML file using Nokogiri with the following snippet:

doc.xpath('//root').each do |root|
  puts "# ROOT found"
  root.xpath('//page').each do |page|
    puts "## PAGE found / #{page['id']} / #{page['name']} / #{page['width']} / #{page['height']}"
    page.children.each do |content|
      ...
    end
  end
end

How can I parse through all elements in the page element? There are three different elements: image, text and video. How can I make a case statement for each element?

like image 578
trnc Avatar asked Dec 06 '22 21:12

trnc


2 Answers

Honestly, you look pretty close to me..

doc.xpath('//root').each do |root|
  puts "# ROOT found"
  root.xpath('//page').each do |page|
    puts "## PAGE found / #{page['id']} / #{page['name']} / #{page['width']} / #{page['height']}"
    page.children.each do |child|
      case child.name
       when 'image'  
          do_image_stuff
       when 'text'
          do_text_stuff
       when 'video'
          do_video_stuff
       end
    end
  end
end
like image 88
noli Avatar answered Dec 11 '22 10:12

noli


Both Nokogiri's CSS and XPath accessors allow multiple tags to be specified, which can be useful for this sort of problem. Rather than walk through every tag in the document's page tag:

require 'nokogiri'

doc = Nokogiri::XML('
  <xml>
  <body>
  <image>image</image>
  <text>text</text>
  <video>video</video>
  <other>other</other>
  <image>image</image>
  <text>text</text>
  <video>video</video>
  <other>other</other>
  </body>
  </xml>')

This is a search using CSS:

doc.search('image, text, video').each do |node|
  case node.name
  when 'image'
    puts node.text
  when 'text'
    puts node.text
  when 'video'
    puts node.text
  else
    puts 'should never get here'
  end
end

# >> image
# >> image
# >> text
# >> text
# >> video
# >> video

Notice it returns the tags in the order that the CSS accessor specifies it. If you need the order of the tags in the document, you can use XPath:

doc.search('//image | //text | //video').each do |node|
  puts node.text
end

# >> image
# >> text
# >> video
# >> image
# >> text
# >> video

In either case, the program should run faster because all the searching occurs in libXML, returning only the nodes you need for Ruby's processing.

If you need to restrict the search to within a <page> tag you can do a search up front to find the page node, then search underneath it:

doc.at('page').search('image, text, video').each do |node|
  ...
end

or

doc.at('//page').search('//image | //text | //video').each do |node|
  ...
end
like image 40
the Tin Man Avatar answered Dec 11 '22 12:12

the Tin Man