<p>If I accept full Unicode for passwords, how should I normalize the string before passing it to the hash function?</p> <h3>Goals</h3> <p>Without normalization, if someone sets their password to "mañana" (<code>ma\u00F1ana</code>) on one computer and tries to log in with "mañana" (<code>ma\u006E\u0303ana</code>) on another computer, the hashes will be different and the login will fail. This is under the control of the user-agent or its operating system.</p> <ul> <li>I'd like to ensure that those hash to the same thing.</li> <li>I am not concerned about homoglyphs such as Α, А, and A (Greek, Cyrillic, Latin).</li> </ul> <h3>Reference</h3> <p>Unicode normalization forms: http://unicode.org/reports/tr15/#Norm_Forms</p> <h3>Considerations</h3> <ul> <li>Any normalization procedure may cause collisions, e.g. <code>"oﬃce" == "office"</code>.</li> <li>Normalization can change the number of bytes in the string.</li> </ul> <h3>Further questions</h3> <ul> <li>What happens if the server receives a byte sequence that is not valid UTF-8 (or other format)? Reject, since it can't be normalized?</li> <li>What happens if the server receives characters that are unassigned in its version of Unicode?</li> </ul>

<p>Normalization is undefined in case of malformed inputs, such as alleged UTF-8 text that contains illegal byte sequences. Illegal bytes may be interpreted differently in different environments: Rejection, replacement, or omission.</p> <p><strong>Recommendation #1</strong>: If possible, reject inputs that do not conform to the expected encoding. (This may be out of the application's control, however.)</p> <p>The Unicode Annex 15 guarantees normalization stability when the input contains assigned characters only:</p> <blockquote> <p>11.1 Stability of Normalized Forms</p> <p>For all versions, even prior to Unicode 4.1, the following policy is followed:</p> <p>A normalized string is guaranteed to be stable; that is, once normalized, a string is normalized according to all future versions of Unicode.</p> <p>More precisely, if a string has been normalized according to a particular version of Unicode and contains only characters allocated in that version, it will qualify as normalized according to any future version of Unicode.</p> </blockquote> <p><strong>Recommendation #2</strong>: Whichever normalization form is used must use the Normalization Process for Stabilized Strings, i.e., reject any password inputs that contain unassigned characters, since their normalization is not guaranteed stable under server upgrades.</p> <p>The compatibility normalization forms seem to handle Japanese better, collapsing several decompositions into the same output where the canonical forms do not.</p> <p>The spec warns:</p> <blockquote> <p>Normalization Forms KC and KD must not be blindly applied to arbitrary text. Because they erase many formatting distinctions, they will prevent round-trip conversion to and from many legacy character sets, and unless supplanted by formatting markup, they may remove distinctions that are important to the semantics of the text. </p> </blockquote> <p>However, semantics and round-tripping are not of concern here.</p> <p><strong>Recommendation #3</strong>: Apply NFKC or NFKD before hashing.</p>

What Unicode normalization (and other processing) is appropriate for passwords when hashing?

Goals

Without normalization, if someone sets their password to "mañana" (ma\u00F1ana) on one computer and tries to log in with "mañana" (ma\u006E\u0303ana) on another computer, the hashes will be different and the login will fail. This is under the control of the user-agent or its operating system.

I'd like to ensure that those hash to the same thing.
I am not concerned about homoglyphs such as Α, А, and A (Greek, Cyrillic, Latin).

Reference

Unicode normalization forms: http://unicode.org/reports/tr15/#Norm_Forms

Considerations

Any normalization procedure may cause collisions, e.g. "oﬃce" == "office".
Normalization can change the number of bytes in the string.

Further questions

What happens if the server receives a byte sequence that is not valid UTF-8 (or other format)? Reject, since it can't be normalized?
What happens if the server receives characters that are unassigned in its version of Unicode?

459

asked Apr 23 '13 15:04

treat your mods well

1 Answers

Normalization is undefined in case of malformed inputs, such as alleged UTF-8 text that contains illegal byte sequences. Illegal bytes may be interpreted differently in different environments: Rejection, replacement, or omission.

Recommendation #1: If possible, reject inputs that do not conform to the expected encoding. (This may be out of the application's control, however.)

The Unicode Annex 15 guarantees normalization stability when the input contains assigned characters only:

11.1 Stability of Normalized Forms

For all versions, even prior to Unicode 4.1, the following policy is followed:

A normalized string is guaranteed to be stable; that is, once normalized, a string is normalized according to all future versions of Unicode.

More precisely, if a string has been normalized according to a particular version of Unicode and contains only characters allocated in that version, it will qualify as normalized according to any future version of Unicode.

Recommendation #2: Whichever normalization form is used must use the Normalization Process for Stabilized Strings, i.e., reject any password inputs that contain unassigned characters, since their normalization is not guaranteed stable under server upgrades.

The compatibility normalization forms seem to handle Japanese better, collapsing several decompositions into the same output where the canonical forms do not.

The spec warns:

Normalization Forms KC and KD must not be blindly applied to arbitrary text. Because they erase many formatting distinctions, they will prevent round-trip conversion to and from many legacy character sets, and unless supplanted by formatting markup, they may remove distinctions that are important to the semantics of the text.

However, semantics and round-tripping are not of concern here.

Recommendation #3: Apply NFKC or NFKD before hashing.

answered Oct 17 '22 07:10

treat your mods well

Related questions
                            
                                Regular Expression for Japanese characters
                            
                                UnicodeDecodeError in Python 3 when importing a CSV file
                            
                                Unicode string in XML
                            
                                Case-insensitive storage and unicode compatibility
                            
                                Unicode issue with an HTML Title, question mark? 65533;
                            
                                installing libicu-dev on mac
                            
                                JavaScript Unicode normalization
                            
                                How to read a "C source, ISO-8859 text"
                            
                                Is a wide character string literal starting with L like L"Hello World" guaranteed to be encoded in Unicode?
                            
                                Python Unicode string stored as '\u84b8\u6c7d\u5730' in file, how to convert it back to Unicode?
                            
                                I don't understand encode and decode in Python (2.7.3)
                            
                                Pandas - Writing an excel file containing unicode - IllegalCharacterError
                            
                                Convert python filenames to unicode
                            
                                Android Get Country Emoji Flag Using Locale
                            
                                pandas to_csv: ascii can't encode character
                            
                                Why isn't string.Normalize consistent depending on the context?
                            
                                Win32 CreateProcess: When is CREATE_UNICODE_ENVIRONMENT *really* needed?
                            
                                How do I check equality of Unicode strings in Javascript?
                            
                                Python print Unicode character
                            
                                Are named entities in HTML still necessary in the age of Unicode aware browsers?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What Unicode normalization (and other processing) is appropriate for passwords when hashing?

Tags:

passwords

unicode

password-storage

unicode-normalization

homoglyph

Goals

Reference

Considerations

Further questions

treat your mods well

People also ask

1 Answers

treat your mods well

Recent Activity

Donate For Us