Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ruby YAML parser by passing constructor

Tags:

ruby

yaml

I am working on an application that takes input from a YAML file, parses them into objects, and let's them do their thing. The only problem I'm having now, is that the YAML parser seems to ignore the objects "initialize" method. I was counting on the constructor to fill in any instance variables the YAML file was lacking with defaults, as well as store some things in class variables. Here is an example:

class Test

    @@counter = 0

    def initialize(a,b)
        @a = a
        @b = b

        @a = 29 if @b == 3

        @@counter += 1
    end

    def self.how_many
        p @@counter
    end

    attr_accessor :a,:b

end

require 'YAML'

a = Test.new(2,3)
s = a.to_yaml
puts s
b = YAML::load(s)
puts b.a
puts b.b
Test.how_many

puts ""

c = Test.new(4,4)
c.b = 3
t = c.to_yaml
puts t
d = YAML::load(t)
puts d.a
puts d.b
Test.how_many

I would have expected the above to output:

--- !ruby/object:Test
a: 29
b: 3
29
3
2

--- !ruby/object:Test
a: 4
b: 3
29
3
4

Instead I got:

--- !ruby/object:Test
a: 29
b: 3
29
3
1

--- !ruby/object:Test
a: 4
b: 3
4
3
2

I don't understand how it makes these objects without using their defined initialize method. I'm also wondering if there is anyway to force the parser to use the initialize method.

like image 349
grasingerm Avatar asked Oct 10 '12 14:10

grasingerm


2 Answers

Deserializing an object from Yaml doesn’t use the initialize method because in general there is no correspondance between the object’s instance variables (which is what the default Yaml serialization stores) and the parameters to initialize.

As an example, consider an object with an initialize that looks like this (with no other instance variables):

def initialize(param_one, param_two)
  @a_variable = some_calculation(param_one, param_two)
end

Now when an instance of this is deserialized, the Yaml processor has a value for @a_variable, but the initialize method requires two parameters, so it can’t call it. Even if the number of instance variables matches the number of parameters to initialize it is not necessarily the case that they correspond, and even if they did the processor doesn’t know the order they shoud be passed to initialize.

The default process for serializing and deserializing a Ruby object to Yaml is to write out all instance variables (with their names) during serialization, then when deserializing allocate a new instance of the class and simply set the same instance variables on this new instance.

Of course sometimes you need more control of this process. If you are using the Psych Yaml processor (which is the default in Ruby 1.9.3) then you should implement the encode_with (for serialisation) or or init_with (for deserialization) methods as appropriate.

For serialization, Psych will call the encode_with method of an object if it is present, passing a coder object. This object allows you to specify how the object should be represented in Yaml – normally you just treat it like a hash.

For deserialization, Psych will call the init_with method if it is present on your object instead of using the default procedure described above, again passing a coder object. This time the coder will contain the information about the objects representation in Yaml.

Note you don’t need to provide both methods, you can just provide either one if you want. If you do provide both, the coder object you get passed in init_with will essentially be the same as the one passed to encode_with after that method has run.

As an example, consider an object that has some instance variables that are calculated from others (perhaps as an optimisation to avoid a large calculation), but shouldn’t be serialized to the Yaml.

class Foo

  def initialize(first, second)
    @first = first
    @second = second
    @calculated = expensive_calculation(@first, @second)
  end

  def encode_with(coder)
    # @calculated shouldn’t be serialized, so we just add the other two.
    # We could provide different names to use in the Yaml here if we
    # wanted (as long as the same names are used in init_with).
    coder['first'] = @first
    coder['second'] = @second
  end

  def init_with(coder)
    # The Yaml only contains values for @first and @second, we need to
    # recalculate @calculated so the object is valid.
    @first = coder['first']
    @second = coder['second']
    @calculated = expensive_calculation(@first, @second)
  end

  # The expensive calculation
  def expensive_calculation(a, b)
    ...
  end
end

When you dump an instance of this class to Yaml, it will look something like this, without the calculated value:

--- !ruby/object:Foo
first: 1
second: 2

When you load this Yaml back into Ruby, the created object will have the @calculated instance variable set.

If you wanted you could call initialize from within init_with, but I think it would be better to keep the a clear separation between initializing a new instance of the class, and deserializing an existing instance from Yaml. I would recommend extracting the common logic into methods that can be called from both instead,

like image 175
matt Avatar answered Oct 09 '22 07:10

matt


If you only want this behavior with pure ruby classes that use @-style instance variables (not those from compiled extensions and not Struct-style), the following should work. YAML seems to call the allocate class method when loading an instance of that class, even if the instance is nested as a member of another object. So we can redefine allocate. Example:

class Foo
  attr_accessor :yaml_flag
  def self.allocate
    super.tap {|o| o.instance_variables.include?(:@yaml_flag) or o.yaml_flag = true }
  end
end
class Bar
  attr_accessor :foo, :yaml_flag
  def self.allocate
    super.tap {|o| o.instance_variables.include?(:@yaml_flag) or o.yaml_flag = true }
  end
end

>> bar = Bar.new
=> #<Bar:0x007fa40ccda9f8>
>> bar.foo = Foo.new
=> #<Foo:0x007fa40ccdf9f8>
>> [bar.yaml_flag, bar.foo.yaml_flag]
=> [nil, nil]
>> bar_reloaded = YAML.load YAML.dump bar
=> #<Bar:0x007fa40cc7dd48 @foo=#<Foo:0x007fa40cc7db90 @yaml_flag=true>, @yaml_flag=true>
>> [bar_reloaded.yaml_flag, bar_reloaded.foo.yaml_flag]
=> [true, true]

# won't overwrite false
>> bar.foo.yaml_flag = false
=> false
>> bar_reloaded = YAML.load YAML.dump bar
=> #<Bar:0x007fa40ccf3098 @foo=#<Foo:0x007fa40ccf2f08 @yaml_flag=false>, @yaml_flag=true>
>> [bar_reloaded.yaml_flag, bar_reloaded.foo.yaml_flag]
=> [true, false]

# won't overwrite nil
>> bar.foo.yaml_flag = nil
=> nil
>> bar_reloaded = YAML.load YAML.dump bar
=> #<Bar:0x007fa40cd73518 @foo=#<Foo:0x007fa40cd73360 @yaml_flag=nil>, @yaml_flag=true>
>> [bar_reloaded.yaml_flag, bar_reloaded.foo.yaml_flag]
=> [true, nil]

I intentionally avoided a o.nil? check in the tap blocks because nil may actually be a meaningful value that you don't want to overwrite.

One last caveat: allocate may be used by third party libraries (or by your own code), and you may not want to set the members in those cases. If you want to restrict allocation, to just yaml loading, you'll have to do something more fragile and complex like check the caller stack in the allocate method to see if yaml is calling it.

I'm on ruby 1.9.3 (with psych) and the top of the stack looks like this (path prefix removed):

psych/visitors/to_ruby.rb:274:in `revive'",
psych/visitors/to_ruby.rb:219:in `visit_Psych_Nodes_Mapping'",
psych/visitors/visitor.rb:15:in `visit'",
psych/visitors/visitor.rb:5:in `accept'",
psych/visitors/to_ruby.rb:20:in `accept'",
psych/visitors/to_ruby.rb:231:in `visit_Psych_Nodes_Document'",
psych/visitors/visitor.rb:15:in `visit'",
psych/visitors/visitor.rb:5:in `accept'",
psych/visitors/to_ruby.rb:20:in `accept'",
psych/nodes/node.rb:35:in `to_ruby'",
psych.rb:128:in `load'",
like image 27
Kelvin Avatar answered Oct 09 '22 05:10

Kelvin