I had a large YAML file with a massive use of YAML anchors and references, for example:
warehouse:
obj1: &obj1
key1: 1
key2: 2
specific:
spec1:
<<: *obj1
spec2:
<<: *obj1
key1: 10
The file got too large, so I looked for a solution that will allow me split to 2 files: warehouse.yaml
and specific.yaml
, and to include the warehouse.yaml
inside the specific.yaml
. I read this simple article, which describes how I can use PyYAML to achieve that, but it also says that the merge key(<<) is not supported.
I really got an error:
yaml.composer.ComposerError: found undefined alias 'obj1
when I tried to go like that.
So, I started looking for alternative way and I got confused because I don't really know much about PyYAML.
Can I get the desired merge key support? Any other solutions for my problem?
Crucial for the handling of anchors and aliases in PyYAML is the dict anchors
that is part of the Composer
. It maps anchor to nodes so that aliases can be looked up. It existence is limited by the existence of the Composer
, which is a composite element of the Loader
that you use.
That Loader
class only exists during the time of the call to yaml.load()
so there is no trivial way to extract this afterwards: first you would have to make the instance of the Loader()
persist and then make sure that the normal compose_document()
method is not called (which among other things does self.anchors = {}
, to be clean for the next document (in a single stream)).
To further complicate things if you would have warehouse.yaml
:
warehouse:
obj1: &obj1
key1: 1
key2: 2
and specific.yaml
:
warehouse: !include warehouse.yaml
specific:
spec1:
<<: *obj1
spec2:
<<: *obj1
key1: 10
you would never get this to work with your snippet, even if you could preserve, extract and pass on the anchor information because the composer handling specific.yaml
will much earlier encountering a non-defined alias than the tag !include
gets used for construction (and filling anchors
).
What you can do to circumvent this problem is to include specific.yaml
specific:
spec1:
<<: *obj1
spec2:
<<: *obj1
key1: 10
from warehouse.yaml
:
warehouse:
obj1: &obj1
key1: 1
key2: 2
specific: !include specific.yaml
, or include both in a third file. Please note that the key specific
is in both files.
With those two files run:
import sys
from ruamel import yaml
def my_compose_document(self):
self.get_event()
node = self.compose_node(None, None)
self.get_event()
# self.anchors = {} # <<<< commented out
return node
yaml.SafeLoader.compose_document = my_compose_document
# adapted from http://code.activestate.com/recipes/577613-yaml-include-support/
def yaml_include(loader, node):
with open(node.value) as inputfile:
return list(my_safe_load(inputfile, master=loader).values())[0]
# leave out the [0] if your include file drops the key ^^^
yaml.add_constructor("!include", yaml_include, Loader=yaml.SafeLoader)
def my_safe_load(stream, Loader=yaml.SafeLoader, master=None):
loader = Loader(stream)
if master is not None:
loader.anchors = master.anchors
try:
return loader.get_single_data()
finally:
loader.dispose()
with open('warehouse.yaml') as fp:
data = my_safe_load(fp)
yaml.safe_dump(data, sys.stdout, default_flow_style=False)
which gives:
specific:
spec1:
key1: 1
key2: 2
spec2:
key1: 10
key2: 2
warehouse:
obj1:
key1: 1
key2: 2
If your specific.yaml
would not have the top-level key specific
:
spec1:
<<: *obj1
spec2:
<<: *obj1
key1: 10
then replace the last line of yaml_include()
with:
return my_safe_load(inputfile, master=loader)
The above was done with ruamel.yaml
(disclaimer: I am the author of that package) and tested on Python 2.7 and 3.6. By changing the import it will work with PyYAML as well.
With the new ruamel.yaml
API the above can be much simplified, because the loader
handed to the yaml_include()
constructor knows about the YAML
instance, but of course you still need an adapted compose_document
that doesn't destroy anchors. Assuming the specific.yaml
without top-level key specific
, the following gives the same output as before.
import sys
from ruamel.std.pathlib import Path
from ruamel.yaml import YAML, version_info
yaml = YAML(typ='safe', pure=True)
yaml.default_flow_style = False
def my_compose_document(self):
self.parser.get_event()
node = self.compose_node(None, None)
self.parser.get_event()
# self.anchors = {} # <<<< commented out
return node
yaml.Composer.compose_document = my_compose_document
# adapted from http://code.activestate.com/recipes/577613-yaml-include-support/
def yaml_include(loader, node):
y = loader.loader
yaml = YAML(typ=y.typ, pure=y.pure) # same values as including YAML
yaml.composer.anchors = loader.composer.anchors
return yaml.load(Path(node.value))
yaml.Constructor.add_constructor("!include", yaml_include)
data = yaml.load(Path('warehouse.yaml'))
yaml.dump(data, sys.stdout)
It seems that someone has now solved this problem as an extension of ruamel.yaml.
pip install ruamel.yaml.include
(source on GitHub)
To get the desired output above:
warehouse.yml
obj1: &obj1
key1: 1
key2: 2
specific.yml
specific:
spec1:
<<: *obj1
spec2:
<<: *obj1
key1: 10
Your code would be:
from ccorp.ruamel.yaml.include import YAML
yaml = YAML(typ='safe', pure=True)
yaml.allow_duplicate_keys = True
with open('specific.yml', 'r') as ymlfile:
return yaml.load(ymlfile)
It also includes a handy !exclude function if you wanted to not have the warehouse key in your output. If you only wanted the specific key, your specific.yml
could begin with:
!exclude includes:
- !include warehouse.yml
In that case, your warehouse.yml could also include the top-level warehouse:
key.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With