Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

XML to Hash conversion: Nori drops the attributes of the deepest XML elements

Tags:

Summary

I'm using Ruby (ruby 2.1.2p95 (2014-05-08) [x86_64-linux-gnu] on my machine, ruby 1.9.3p484 (2013-11-22 revision 43786) [x86_64-linux] in production environment) and Nori to convert an XML document (initially processed with Nokogiri for some validation) into a Ruby Hash, but I later discovered that Nori is dropping the attributes of the deepest XML elements.

Issue Details and Reproducing

To do this, I'm using code similar to the following:

xml  = Nokogiri::XML(File.open('file.xml')) { |config| config.strict.noblanks }
hash = Nori.new.parse xml.to_s

The code generally works as intended, except for one case. Whenever Nori parses the XML text, it drops element attributes from the leaf elements (i.e. elements that have no child elements).

For example, the following document:

<?xml version="1.0"?>
<root>
  <objects>
    <object>
      <fields>
        <id>1</id>
        <name>The name</name>
        <description>A description</description>
      </fields>
    </object>
  </objects>
</root>

...is converted to the expected Hash (some output omitted for brevity):

irb(main):066:0> xml = Nokogiri::XML(txt) { |config| config.strict.noblanks }
irb(main):071:0> ap Nori.new.parse(xml.to_s), :indent => -2
{
  "root" => {
    "objects" => {
      "object" => {
        "fields" => {
          "id"   => "1",
          "name" => "The name"
          "description" => "A description"
        }
      }
    }
  }
}

The problem shows up when element attributes are used on elements with no children. For example, the following document is not converted as expected:

<?xml version="1.0"?>
<root>
  <objects>
    <object id="1">
      <fields>
        <field name="Name">The name</field>
        <field name="Description">A description</field>
      </fields>
    </object>
  </objects>
</root>

The same Nori.new.parse(xml.to_s), as displayed by awesome_print, shows the attributes of the deepest <field> elements are absent:

irb(main):131:0> ap Nori.new.parse(xml.to_s), :indent => -2
{
  "root" => {
    "objects" => {
      "object" => {
        "fields" => {
          "field" => [
            [0] "The name",
            [1] "A description"
          ]
        },
        "@id"    => "1"
      }
    }
  }
}

The Hash only has their values as a list, which is not what I wanted. I expected the <field> elements to retain their attributes just like their parent elements (e.g. see @id="1" for <object>), not for their attributes to get chopped off.

Even if the document is modified to look as follows, it still doesn't work as expected:

<?xml version="1.0"?>
<root>
  <objects>
    <object id="1">
      <fields>
        <Name type="string">The name</Name>
        <Description type="string">A description</Description>
      </fields>
    </object>
  </objects>
</root>

It produces the following Hash:

{
  "root" => {
    "objects" => {
      "object" => {
        "fields" => {
          "Name"        => "The name",
          "Description" => "A description"
        },
        "@id"    => "1"
      }
    }
  }
}

Which lacks the type="whatever" attributes for each field entry.

Searching eventually lead me to Issue #59 with the last post (from Aug 2015) stating he can't "find the bug in Nori's code."

Conclusion

So, my question is: Are any of you aware of a way to work around the Nori issue (e.g. perhaps a setting) that would allow me to use my original schema (i.e. the one with attributes in elements with no children)? If so, can you share a code snippet that will handle this correctly?

I had to re-design my XML schema and change code at about three times to make it work, so if there's a way to get Nori to behave, and I'm simply not aware of it, I'd like to know what it is.

I'd like to avoid installing more libraries as much as possible just to get this working properly with the schema structure I originally wanted to use, but I'm open to the possibility if it's proven to work. (I'd have to re-factor the code once again...) Frameworks are definitely overkill for this, so please: do not suggest Ruby on Rails or similar full-stack solutions.

Please note that my current solution, based on a (reluctantly) redesigned schema, is working, but it's more complicated to generate and process than the original one, and I'd like to go back to the simpler/shallower schema.

like image 804
code_dredd Avatar asked Mar 01 '16 22:03

code_dredd


1 Answers

Nori is not actually dropping the attributes, they are just not being printed.

If you run the ruby script:

require 'nori'

data = Nori.new(empty_tag_value: true).parse(<<XML)
<?xml version="1.0"?>
<root>
  <objects>
    <object>
      <fields>
        <field name="Name">The name</field>
        <field name="Description">A description</field>
      </fields>
    </object>
  </objects>
</root>
XML

field_list = data['root']['objects']['object']['fields']['field']

puts "text: '#{field_list[0]}' data: #{field_list[0].attributes}"
puts "text: '#{field_list[1]}' data: #{field_list[1].attributes}"

You should get the output

["The name", "A description"]
text: 'The name' data: {"name"=>"Name"}
text: 'A description' data: {"name"=>"Description"}

Which clearly shows that the attribute are there, but are not displayed by the inspect method (the p(x) function being the same as puts x.inspect).

You will notice that puts field_list.inspect outputs ["The name", "A description"]. but field_list[0].attributes prints the attribute key and data.

If you would like to have pp display this you can overload the inspect method in the Nori::StringWithAttributes.

class Nori
  class StringWithAttributes < String
    def inspect
      [attributes, String.new(self)].inspect
    end
  end
end

Or if you wanted to change the output you could overload the self.new method to have it return a different data strcture.

class Nori
  class MyText < Array
    def attributes=(data)
      self[1] = data
    end
    attr_accessor :text
    def initialize(text)
      self[0] = text
      self[1] = {}
    end
  end
  class StringWithAttributes < String
    def self.new(x)
      MyText.new(x)
    end
  end
end

And access the data as a tuple

puts "text: '#{data['root']['objects']['object']['fields']['field'][0].first}' data: #{ data['root']['objects']['object']['fields']['field'][0].last}"

This would make it so you could have the data as JSON or YAML as the text items would look like arrays with 2 elements. pp also works.

{"root"=>
  {"objects"=>
    {"object"=>
      {"fields"=>
        {"field"=>
          [["The name", {"name"=>"Name"}],
           ["A description", {"name"=>"Description"}]]},
       "bob"=>[{"@id"=>"id1"}, {"@id"=>"id2"}],
       "bill"=>
        [{"p"=>["one", {}], "@id"=>"bid1"}, {"p"=>["two", {}], "@id"=>"bid2"}],
       "@id"=>"1"}}}}

This should do what you want.

require 'awesome_print'
require 'nori'

# Copyright (c) 2016 G. Allen Morris III
#
# Awesome Print is freely distributable under the terms of MIT license.
# See LICENSE file or http://www.opensource.org/licenses/mit-license.php
#------------------------------------------------------------------------------
module AwesomePrint
  module Nori

    def self.included(base)
      base.send :alias_method, :cast_without_nori, :cast
      base.send :alias_method, :cast, :cast_with_nori
    end

    # Add Nori XML Node and NodeSet names to the dispatcher pipeline.
    #-------------------------------------------------------------------
    def cast_with_nori(object, type)
      cast = cast_without_nori(object, type)
      if defined?(::Nori::StringWithAttributes) && object.is_a?(::Nori::StringWithAttributes)
        cast = :nori_xml_node
      end
      cast
    end

    #-------------------------------------------------------------------
    def awesome_nori_xml_node(object)
      return %Q|["#{object}", #{object.attributes}]|
    end
  end
end

AwesomePrint::Formatter.send(:include, AwesomePrint::Nori)

data = Nori.new(empty_tag_value: true).parse(<<XML)
<?xml version="1.0"?>
<root>
  <objects>
    <object>
      <fields>
        <field name="Name">The name</field>
        <field name="Description">A description</field>
      </fields>
    </object>
  </objects>
</root>
XML

ap data

as the output is:

{
    "root" => {
        "objects" => {
            "object" => {
                "fields" => {
                    "field" => [
                        [0] ["The name", {"name"=>"Name"}],
                        [1] ["A description", {"name"=>"Description"}]
                    ]
                }
            }
        }
    }
}
like image 65
G. Allen Morris III Avatar answered Oct 11 '22 13:10

G. Allen Morris III