Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode: How to obtain all code points for a character like e.g. ã (so it can be used in JavaScript regex)?

My Unicode-related vocabulary isn't very good, so sorry for the verbose question.

A character like ã can be represented by \u00e3 (Latin small letter a with tilde), or \u0061 (Latin small letter a) in combination with combining diacritical mark \u0303 (combining tilde). Now, in Java, in order to match any Unicode letter, I'd look for [\p{L}], but JavaScript doesn't understand that, so I'll have to look for the individual code points (\unnnn). How can I start with an ã and figure out all the various ways it can be represented in Unicode so I can include them in my regular expression in \unnnn format?

like image 639
Christian Avatar asked Dec 04 '25 18:12

Christian


1 Answers

How can I start with an ã and figure out all the various ways it can be represented in Unicode

You're looking for the Unicode Equivalence.

The 2 forms you mentioned are the composed form, and the decomposed form. To get cannonically equivalent Unicode forms, you could use String.prototype.normalize().

  • Important: Check the link for Browser Compatibility.

str.normalize([form]) accepts the following forms:

  • NFC — Normalization Form Canonical Composition.
  • NFD — Normalization Form Canonical Decomposition.
  • NFKC — Normalization Form Compatibility Composition.
  • NFKD — Normalization Form Compatibility Decomposition.

Code point sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed.

Sequences that are defined as compatible are assumed to have possibly distinct appearances, but the same meaning in some contexts.

Quote from Wikipedia

-Choose the equivalence form you like.


For example, using the Latin small letter a with tilde in Compatibility Form:

var char = "ã";
var nfkc = char.normalize('NFKC');
var nfkd = char.normalize('NFKD');

// Returns bytes as Unicode escaped sequences
function escapeUnicode(str){
    var i;
    var result = "";
    for( i = 0; i < str.length; ++i){
        var c = str.charCodeAt(i);
        c = c.toString(16).toUpperCase();
        while (c.length < 4) {
            c = "0" + c;
        }
        result += "\\u" + c;
    }
    return result;
}

var char = "ã";
var nfkc = char.normalize('NFKC');
var nfkd = char.normalize('NFKD');

document.write('<br />NFKC: ' + escapeUnicode(nfkc));
document.write('<br />NFKD: ' + escapeUnicode(nfkd));
like image 164
Mariano Avatar answered Dec 06 '25 09:12

Mariano



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!