Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm for YAML minification that includes reference insertion

Is there an existing tool, library or algorithm that implements YAML minification with reference insertion? I don't want to reinvent the wheel if there is already a common solution for this problem.

I made up the term reference insertion in this context, so it's possible there is a different name for the behavior I want. Let me explain what I mean by reference insertion:

  • A basic YAML minification algorithm without reference insertion, MinifyYamlBasic , would be isomorphic to JSON minification, but would replace any JSON primitive with the least verbose equivalent YAML node syntax. I found an online example of MinifyYamlBasic implementation at https://onlineyamltools.com/minify-yaml .

  • A YAML minification algorithm with reference insertion, MinifyYamlWithReferences would do everything that MinifyYamlBasic does, but would then generate anchors for any YAML nodes that occur multiple times and replace duplicate YAML nodes with references to the generated anchor, in descending order of YAML node text length.

This is the basic idea for the behavior I'm seeking, although there would obviously be additional factors to consider. Duplicate nodes that are smaller than their aliases probably shouldn't be replaced by references. Input documents would need to be normalized so that semantically equivalent nodes can be identified by syntactic comparison, which would include, at least: canonical form substitution of scalars; canonical ordering of mapping node keys; dereferencing and removal of preexisting anchors.

There's probably more that I haven't considered yet. Before I take the time to consider it further,

  1. Are there any available resources that provide an existing implementation of MinifyYamlWithReferences (or something notably similar) ?

  2. Is there a commonly used name for the algorithm (or category of algorithms) that I have referred to as "minification with reference insertion"?

like image 720
smartcaveman Avatar asked Oct 16 '22 17:10

smartcaveman


1 Answers

What you want to do is a transformation. The resulting YAML of such an operation would semantically not be equivalent to the original YAML. Take, for example, this small YAML:

[foo, foo]

This YAML represents the following document graph:

    +----------------------+
    | Sequence (root node) |
    +--|----------------|--+
       |                |
    +--v--+          +--v--+
    | foo |          | foo |
    +-----+          +-----+

Using your MinifyYamlWithReferences, this would become something like:

[&a foo, *a]

Which represents the following document graph:

    +----------------------+
    | Sequence (root node) |
    +--|----------------|--+
       |                |
    +--v--+             |
    | foo |<------------+
    +-----+

So you created a semantically different YAML document! Calling this minification would be misleading, as minification is generally thought of as the process of transforming a source to a smaller, but semantically equivalent, source. What you are after is deduplication (along with minification), which is a semantic transformation.

I am not aware of any existing implementations for that. Implementing it would also be tough as you need to answer difficult questions like:

  1. Which of !!str a, "a", 'a' and a are equivalent?
  2. Are these two equivalent: &a [ *a ], [*a]?

(Explanation: In 1., we have two quoted scalars that both evaluate to a scalar node with ! tag. ! on scalars gets resolved to !!str using the YAML core schema. a is a scalar that does not match a regex for anything more specific than a string, so it would also get !!str according to the core schema. But are you sure the core schema is used? In 2., we have a sequence with a direct cycle, and a sequence that just links to that cycle, but a (theoretical) comparison by infinite recursion would say that the two are equal. So are they?)

I fail to see the use-case for such a transformation. If your goal is to minimize data sent over the network, I would suggest to use a compression algorithm on the initial YAML instead – that is by far simpler (already exists, internally also does deduplication on the character stream, but reversible).

like image 177
flyx Avatar answered Oct 20 '22 23:10

flyx