Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a well-defined difference between "normalizing" and "canonicalizing" data?

I understand canonicalization and normalization to mean removing any non-meaningful or ambiguous parts of of a data's presentation, turning effectively identical data into actually identical data.

For example, if you want to get the hash of some input data and it's important that anyone else hashing the canonically same data gets the same hash, you don't want one file indenting with tabs and the other using spaces (and no other difference) to cause two very different hashes.

In the case of JSON:

  • object properties would be placed in a standard order (perhaps alphabetically)
  • unnecessary white spaces would be stripped
  • indenting either standardized or stripped
  • the data may even be re-modeled in an entirely new syntax, to enforce the above

Is my definition correct, and the terms are interchangeable? Or is there a well-defined and specific difference between canonicalization and normalization of input data?

like image 266
Jacob Ford Avatar asked Oct 16 '22 05:10

Jacob Ford


2 Answers

"Canonicalize" & "normalize" (from "canonical (form)" & "normal form") are two related general mathematical terms that also have particular uses in particular contexts per some exact meaning given there. It is reasonable to label a particular process by one of those terms when the general meaning applies.

Your characterizations of those specific uses are fuzzy. The formal meanings for general & particular cases are more useful.

Sometimes given a bunch of things we partition them (all) into (disjoint) groups, aka equivalence classes, of ones that we consider to be in some particular sense similar or the same, aka equivalent. The members of a group/class are the same/equivalent according to some particular equivalence relation.

We pick a particular member as the representative thing from each group/class & call it the canonical form for that group & its members. Two things are equivalent exactly when they are in the same equivalence class. Two things are equivalent exactly when their canonical forms are equal.

A normal form might be a canonical form or just one of several distinguished members.

To canonicalize/normalize is to find or use a canonical/normal form of a thing.

Canonical form.

The distinction between "canonical" and "normal" forms varies by subfield. In most fields, a canonical form specifies a unique representation for every object, while a normal form simply specifies its form, without the requirement of uniqueness.

Applying the definition to your example: Have you a bunch of values that you are partitioning & are you picking some member(s) per each class instead of the other members of that class? Well you have JSON values and short of re-modeling them you are partitioning them per what same-class member they map to under a function. So you can reasonably call the result JSON values canonical forms of the inputs. If you characterize re-modeling as applicable to all inputs then you can also reasonably call the post-re-modeling form of those canonical values canonical forms of re-modeled input values. But if not then people probably won't complain that you call the re-modeled values canonical forms of the input values even though technically they wouldn't be.

like image 76
philipxy Avatar answered Oct 21 '22 01:10

philipxy


Consider a set of objects, each of which can have multiple representations. From your example, that would be the set of JSON objects and the fact that each object has multiple valid representations, e.g., each with different permutations of its members, less white spaces, etc.

Canonicalization is the process of converting any representation of a given object to one and only one, unique per object, representation (a.k.a, canonical form). To test whether two representations are of the same object, it suffices to test equality on their canonical forms, see also wikipedia's definition.

Normalization is the process of converting any representation of a given object to a set of representations (a.k.a., "normal forms") that is unique per object. In such case, equality between two representations is achieved by "subtracting" their normal forms and comparing the result with a normal form of "zero" (typically a trivial comparison). Normalization may be a better option when canonical forms are difficult to implement consistently, e.g., because they depend on arbitrary choices (like ordering of variables).

Section 1.2 from the "A=B" book, has some really good examples for both concepts.

like image 42
Panos Avatar answered Oct 21 '22 02:10

Panos