My Unicode-related vocabulary isn't very good, so sorry for the verbose question.
A character like ã can be represented by \u00e3 (Latin small letter a with tilde), or \u0061 (Latin small letter a) in combination with combining diacritical mark \u0303 (combining tilde). Now, in Java, in order to match any Unicode letter, I'd look for [\p{L}], but JavaScript doesn't understand that, so I'll have to look for the individual code points (\unnnn). How can I start with an ã and figure out all the various ways it can be represented in Unicode so I can include them in my regular expression in \unnnn format?
How can I start with an ã and figure out all the various ways it can be represented in Unicode
You're looking for the Unicode Equivalence.
The 2 forms you mentioned are the composed form, and the decomposed form. To get cannonically equivalent Unicode forms, you could use String.prototype.normalize().
str.normalize([form]) accepts the following forms:
Code point sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed.
Sequences that are defined as compatible are assumed to have possibly distinct appearances, but the same meaning in some contexts.
Quote from Wikipedia
-Choose the equivalence form you like.
For example, using the Latin small letter a with tilde in Compatibility Form:
var char = "ã";
var nfkc = char.normalize('NFKC');
var nfkd = char.normalize('NFKD');
// Returns bytes as Unicode escaped sequences
function escapeUnicode(str){
var i;
var result = "";
for( i = 0; i < str.length; ++i){
var c = str.charCodeAt(i);
c = c.toString(16).toUpperCase();
while (c.length < 4) {
c = "0" + c;
}
result += "\\u" + c;
}
return result;
}
var char = "ã";
var nfkc = char.normalize('NFKC');
var nfkd = char.normalize('NFKD');
document.write('<br />NFKC: ' + escapeUnicode(nfkc));
document.write('<br />NFKD: ' + escapeUnicode(nfkd));
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With