Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the point of String.normalize()?

While reviewing JavaScript concepts, I found String.normalize(). This is not something that shows up in W3School's "JavaScript String Reference", and, hence, it is the reason I might have missed it before.

I found more information about it in HackerRank which states:

Returns a string containing the Unicode Normalization Form of the calling string's value.

With the example:

var s = "HackerRank";
console.log(s.normalize());
console.log(s.normalize("NFKC"));

having as output:

HackerRank
HackerRank

Also, in GeeksForGeeks:

The string.normalize() is an inbuilt function in javascript which is used to return a Unicode normalisation form of a given input string.

with the example:

<script> 
  
  // Taking a string as input. 
  var a = "GeeksForGeeks"; 
    
  // calling normalize function. 
  b = a.normalize('NFC') 
  c = a.normalize('NFD') 
  d = a.normalize('NFKC') 
  e = a.normalize('NFKD') 
    
  // Printing normalised form. 
  document.write(b +"<br>"); 
  document.write(c +"<br>"); 
  document.write(d +"<br>"); 
  document.write(e); 
    
</script> 

having as output:

GeeksForGeeks
GeeksForGeeks
GeeksForGeeks
GeeksForGeeks

Maybe the examples given are just really bad as they don't allow me to see any change.

I wonder... what's the point of this method?

like image 821
Tiago Martins Peres Avatar asked Jan 25 '23 21:01

Tiago Martins Peres


2 Answers

It depends on what will do with strings: often you do not need it (if you are just getting input from user, and putting it to user). But to check/search/use as key/etc. such strings, you may want a unique way to identify the same string (semantically speaking).

The main problem is that you may have two strings which are semantically the same, but with two different representations: e.g. one with a accented character [one code point], and one with a character combined with accent [one code point for character, one for combining accent]. User may not be in control on how the input text will be sent, so you may have two different user names, or two different password. But also if you mangle data, you may get different results, depending on initial string. Users do not like it.

An other problem is about unique order of combining characters. You may have an accent, and a lower tail (e.g. cedilla): you may express this with several combinations: "pure char, tail, accent", "pure char, accent, tail", "char+tail, accent", "char+accent, cedilla".

And you may have degenerate cases (especially if you type from a keyboard): you may get code points which should be removed (you may have a infinite long string which could be equivalent of few bytes.

In any case, for sorting strings, you (or your library) requires a normalized form: if you already provide the right, the lib will not need to transform it again.

So: you want that the same (semantically speaking) string has the same sequence of unicode code points.

Note: If you are doing directly on UTF-8, you should also care about special cases of UTF-8: same codepoint could be written in different ways [using more bytes]. Also this could be a security problem.

The K is often used for "searches" and similar tasks: CO2 and CO₂ will be interpreted in the same manner, but this could change the meaning of the text, so it should often used only internally, for temporary tasks, but keeping the original text.

like image 80
Giacomo Catenazzi Avatar answered Feb 02 '23 10:02

Giacomo Catenazzi


As stated in MDN documentation, String.prototype.normalize() return the Unicode Normalized Form of the string. This because in Unicode, some characters can have different representation code.

This is the example (taken from MDN):

const name1 = '\u0041\u006d\u00e9\u006c\u0069\u0065';
const name2 = '\u0041\u006d\u0065\u0301\u006c\u0069\u0065';

console.log(`${name1}, ${name2}`);
// expected output: "Amélie, Amélie"
console.log(name1 === name2);
// expected output: false
console.log(name1.length === name2.length);
// expected output: false

const name1NFC = name1.normalize('NFC');
const name2NFC = name2.normalize('NFC');

console.log(`${name1NFC}, ${name2NFC}`);
// expected output: "Amélie, Amélie"
console.log(name1NFC === name2NFC);
// expected output: true
console.log(name1NFC.length === name2NFC.length);
// expected output: true

As you can see, the string Amélie as two different Unicode representations. With normalization, we can reduce the two forms to the same string.

like image 20
Greedo Avatar answered Feb 02 '23 09:02

Greedo