This question tries to collect information spread over questions about different languages and YAML implementations in a mostly language-agnostic manner.
Suppose I have a YAML file like this:
first:
- foo: {a: "b"}
- "bar": [1, 2, 3]
second: | # some comment
some long block scalar value
I want to load this file into an native data structure, possibly change or add some values, and dump it again. However, when I dump it, the original formatting is not preserved:
"b"
loses its quotation marks, the value of second
is not a literal block scalar anymore, etc.foo
is written in block style instead of the given flow style, similarly the sequence value of "bar"
is written in block stylefirst
/second
) changesfirst
are not indented anymore.How can I preserve the formatting of the original file?
You can open a YAML file in any text editor, such as Microsoft Notepad (Windows) or Apple TextEdit (Mac). However, if you intend to edit a YAML file, you should open it using a source code editor, such as NotePad++ (Windows) or GitHub Atom (cross-platform).
Dumping YAML dump accepts the second optional argument, which must be an open text or binary file. In this case, yaml. dump will write the produced YAML document into the file. Otherwise, yaml. dump returns the produced document.
Developer can use remote URL to save and edit the YAML files. Here are few samples of the YAML URL. Now click on these YAML samples. Once this link is open, use save as functionality of the browser and save.
We can read the YAML file using the PyYAML module's yaml. load() function. This function parse and converts a YAML object to a Python dictionary ( dict object). This process is known as Deserializing YAML into a Python.
Preface: Throughout this answer, I mention some popular YAML implementations. Those mentions are never exhaustive since I do not know all YAML implementations out there.
I will use YAML terms for data structures: Atomic text content (even numbers) is a scalar. Item sequences, known elsewhere as arrays or lists, are sequences. A collection of key-value pairs, known elsewhere as dictionary or hash, is a mapping.
If you are using Python, using ruamel will help you preserve quite some formatting since it implements round-tripping up to native structures. However, it isn't perfect and cannot preserve all formatting.
The process of loading YAML is also a process of losing information. Let's have a look at the process of loading/dumping YAML, as given in the spec:
When you are loading a YAML file, you are executing some or all of the steps in the Load direction, starting at the Presentation (Character Stream). YAML implementations usually promote their most high-level APIs, which load the YAML file all the way to Native (Data Structure). This is true for most common YAML implementations, e.g. PyYAML/ruamel, SnakeYAML, go-yaml, and Ruby's YAML module. Other implementations, such as libyaml and yaml-cpp, only provide deserialization up to the Representation (Node Graph), possibly due to restrictions of their implementation languages (loading into native data structures requires either compile-time or runtime reflection on types).
The important information for us is what is contained in those boxes. Each box mentions information which is not available anymore in the box left to it. So this means that styles and comments, according to the YAML specification, are only present in the actual YAML file content, but are discarded as soon as the YAML file is parsed. For you, this means that once you have loaded a YAML file to a native data structure, all information about how it originally looked in the input file is gone. Which means that when you dump the data, the YAML implementation chooses a representation it deems useful for your data. Some implementations let you give general hints/options, e.g. that all scalars should be quoted, but that doesn't help you restore the original formatting.
Thankfully, this diagram only describes the logical process of loading YAML; a conforming YAML implementation does not need to slavishly conform to it. Most implementations actually preserve data longer than they need to. This is true for PyYAML/ruamel, SnakeYAML, go-yaml, yaml-cpp, libyaml and others. In all these implementations, the style of scalars, sequences and mappings is remembered up until the Representation (Node Graph) level.
On the other hand, comments are discarded rather early since they do not belong to an event or node (the exceptions here is ruamel which links comments to the following event, and go-yaml which remembers comments before, at and after the line that created a node). Some YAML implementations (libyaml, SnakeYAML) provide access to a token stream which is even more low-level than the Event Tree. This token stream does contain comments, however it is only usable for doing things like syntax highlighting, since the APIs do not contain methods for consuming the token stream again.
If you need to only load your YAML file and then dump it again, use one of the lower-level APIs of your implementation to only load the YAML up until the Representation (Node Graph) or Serialization (Event Tree) level. The API functions to search for are compose/parse and serialize/present respectively.
It is preferable to use the Event Tree instead of the Node Graph as some implementations already forget the original order of mapping keys (due to internally using hashmaps) when composing. This question, for example, details loading / dumping events with SnakeYAML.
Information that is already lost in the event stream of your implementation, for example comments in most implementations, is impossible to preserve. Also impossible to preserve is scalar layout, like in this example:
"1 \x2B 1"
This loads as string "1 + 1"
after resolving the escape sequence. Even in the event stream, the information about the escape sequence has already been lost in all implementations I know. The event only remembers that it was a double-quoted scalar, so writing it back will result in:
"1 + 1"
Similarly, a folded block scalar (starting with >
) will usually not remember where line breaks in the original input have been folded into space characters.
To sum up, loading to the Event Tree and dumping again will usually preserve:
You will usually lose:
If you use the Node Graph instead of the Event Tree, you will likely lose anchor representations (i.e. that &foo
may be written out as &a
later with all aliases referring to it using *a
instead of *foo
). You might also lose key order in mappings. Some APIs, like go-yaml, don't provide access to the Event Tree, so you have no choice but to use the Node Graph instead.
If you want to modify data and still preserve what you can of the original formatting, you need to manipulate your data without loading it to a native structure. This usually means that you operate on YAML scalars, sequences and mappings, instead of strings, numbers, lists or whatever structures the target programming language provides.
You have the option to either process the Event Tree or the Node Graph (assuming your API gives you access to it). Which one is better usually depends on what you want to do:
In any case, you need to know a bit about YAML type resolution to work with the given data correctly. When you load a YAML file into a declared native structure (typical in languages with a static type system, e.g. Java or Go), the YAML processor will map the YAML structure to the target type if that's possible. However, if no target type is given (typical in scripting languages like Python or Ruby, but also possible in Java), types are deduced from node content and style.
Since we are not working with native loading because we need to preserve formatting information, this type resolution will not be executed. However, you need to know how it works in two cases:
42
and need to know whether that is a string or integer.42
, you might want to control whether that it is loaded as integer 42
or string "42"
later.I won't discuss all the details here; in most cases, it suffices to know that if a string is encoded as a scalar but looks like something else (e.g. a number), you should use a quoted scalar.
Depending on your implementation, you may come in touch with YAML tags. Seldom used in YAML files (they look like e.g. !!str
, !!map
, !!int
and so on), they contain type information about a node which can be used in collections with heterogeneous data. More importantly, YAML defines that all nodes without an explicit tag will be assigned one as part of type resolution. This may or may not have already happened at the Node Graph level. So in your node data, you may see a node's tag even when the original node does not have one.
Tags starting with two exclamation marks are actually shorthands, e.g. !!str
is a shorthand for tag:yaml.org,2002:str
. You may see either in your data, since implementations handle them quite differently.
Important for you is that when you create a node or event, you may be able and may also need to assign a tag. If you don't want the output to contain an explicit tag, use the non-specific tags !
for non-plain scalars and ?
for everything else on event level. On node level, consult your implementation's documentation about whether you need to supply resolved tags. If not, same rule for the non-specific tags applies. If the documentation does not mention it (few do), try it out.
So to sum up: You modify data by loading either the Event Tree or the Node Graph, you add, delete or modify events or nodes in the data you get, and then you present the modified data as YAML again. Depending on what you want to do, it may help you to create the data you want to add to your YAML file as native structure, serialize it to YAML and then load it again as Node Graph or Event Tree. From there, you can include it in the structure of the YAML file you want to modify.
YAML has not been designed for this task. In fact, it has been defined as a serialization language, assuming that your data is authored as native data structures in some programming language and from there dumped to YAML. However, in reality, YAML is used a lot for configuration, meaning that you typically write YAML by hand and then load it into native data structures.
This contrast is the reason why it is so difficult to modify YAML files while preserving formatting: The YAML format has been designed as transient data format, to be written by one application, and then to be loaded by another (or the same) application. In that process, preserving formatting does not matter. It does, however, for data that is checked-in to version control (you want your diff to only contain the line(s) with data you actually changed), and other situations where you write your YAML by hand, because you want to keep style consistent.
There is no perfect solution for changing exactly one data item in a given YAML file and leaving everything else intact. Loading a YAML file does not give you a view of the YAML file, it gives you the content it describes. Therefore, everything that is not part of the described content – most importantly, comments and whitespace – is extremely hard to preserve.
If format preservation is important to you and you can't live with the compromises made by the suggestions in this answer, YAML is not the right tool for you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With