If I accept full Unicode for passwords, how should I normalize the string before passing it to the hash function?
Without normalization, if someone sets their password to "mañana" (ma\u00F1ana
) on one computer and tries to log in with "mañana" (ma\u006E\u0303ana
) on another computer, the hashes will be different and the login will fail. This is under the control of the user-agent or its operating system.
Unicode normalization forms: http://unicode.org/reports/tr15/#Norm_Forms
"office" == "office"
.Unicode Normalization Forms are formally defined normalizations of Unicode strings which make it possible to determine whether any two Unicode strings are equivalent to each other.
Normalization Form Canonical Composition. Characters are decomposed and then recomposed by canonical equivalence. NFKD. Normalization Form Compatibility Decomposition. Characters are decomposed by compatibility, and multiple combining characters are arranged in a specific order.
Normalization is undefined in case of malformed inputs, such as alleged UTF-8 text that contains illegal byte sequences. Illegal bytes may be interpreted differently in different environments: Rejection, replacement, or omission.
Recommendation #1: If possible, reject inputs that do not conform to the expected encoding. (This may be out of the application's control, however.)
The Unicode Annex 15 guarantees normalization stability when the input contains assigned characters only:
11.1 Stability of Normalized Forms
For all versions, even prior to Unicode 4.1, the following policy is followed:
A normalized string is guaranteed to be stable; that is, once normalized, a string is normalized according to all future versions of Unicode.
More precisely, if a string has been normalized according to a particular version of Unicode and contains only characters allocated in that version, it will qualify as normalized according to any future version of Unicode.
Recommendation #2: Whichever normalization form is used must use the Normalization Process for Stabilized Strings, i.e., reject any password inputs that contain unassigned characters, since their normalization is not guaranteed stable under server upgrades.
The compatibility normalization forms seem to handle Japanese better, collapsing several decompositions into the same output where the canonical forms do not.
The spec warns:
Normalization Forms KC and KD must not be blindly applied to arbitrary text. Because they erase many formatting distinctions, they will prevent round-trip conversion to and from many legacy character sets, and unless supplanted by formatting markup, they may remove distinctions that are important to the semantics of the text.
However, semantics and round-tripping are not of concern here.
Recommendation #3: Apply NFKC or NFKD before hashing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With