Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What Unicode normalization (and other processing) is appropriate for passwords when hashing?

If I accept full Unicode for passwords, how should I normalize the string before passing it to the hash function?

Goals

Without normalization, if someone sets their password to "mañana" (ma\u00F1ana) on one computer and tries to log in with "mañana" (ma\u006E\u0303ana) on another computer, the hashes will be different and the login will fail. This is under the control of the user-agent or its operating system.

  • I'd like to ensure that those hash to the same thing.
  • I am not concerned about homoglyphs such as Α, А, and A (Greek, Cyrillic, Latin).

Reference

Unicode normalization forms: http://unicode.org/reports/tr15/#Norm_Forms

Considerations

  • Any normalization procedure may cause collisions, e.g. "office" == "office".
  • Normalization can change the number of bytes in the string.

Further questions

  • What happens if the server receives a byte sequence that is not valid UTF-8 (or other format)? Reject, since it can't be normalized?
  • What happens if the server receives characters that are unassigned in its version of Unicode?
like image 459
treat your mods well Avatar asked Apr 23 '13 15:04

treat your mods well


People also ask

Why is Unicode Normalization important?

Unicode Normalization Forms are formally defined normalizations of Unicode strings which make it possible to determine whether any two Unicode strings are equivalent to each other.

What is NFKD normalization?

Normalization Form Canonical Composition. Characters are decomposed and then recomposed by canonical equivalence. NFKD. Normalization Form Compatibility Decomposition. Characters are decomposed by compatibility, and multiple combining characters are arranged in a specific order.


1 Answers

Normalization is undefined in case of malformed inputs, such as alleged UTF-8 text that contains illegal byte sequences. Illegal bytes may be interpreted differently in different environments: Rejection, replacement, or omission.

Recommendation #1: If possible, reject inputs that do not conform to the expected encoding. (This may be out of the application's control, however.)

The Unicode Annex 15 guarantees normalization stability when the input contains assigned characters only:

11.1 Stability of Normalized Forms

For all versions, even prior to Unicode 4.1, the following policy is followed:

A normalized string is guaranteed to be stable; that is, once normalized, a string is normalized according to all future versions of Unicode.

More precisely, if a string has been normalized according to a particular version of Unicode and contains only characters allocated in that version, it will qualify as normalized according to any future version of Unicode.

Recommendation #2: Whichever normalization form is used must use the Normalization Process for Stabilized Strings, i.e., reject any password inputs that contain unassigned characters, since their normalization is not guaranteed stable under server upgrades.

The compatibility normalization forms seem to handle Japanese better, collapsing several decompositions into the same output where the canonical forms do not.

The spec warns:

Normalization Forms KC and KD must not be blindly applied to arbitrary text. Because they erase many formatting distinctions, they will prevent round-trip conversion to and from many legacy character sets, and unless supplanted by formatting markup, they may remove distinctions that are important to the semantics of the text.

However, semantics and round-tripping are not of concern here.

Recommendation #3: Apply NFKC or NFKD before hashing.

like image 92
treat your mods well Avatar answered Oct 17 '22 07:10

treat your mods well