Unicode: How to obtain all code points for a character like e.g. ã (so it can be used in JavaScript regex)?

Question

My Unicode-related vocabulary isn't very good, so sorry for the verbose question.

A character like ã can be represented by \u00e3 (Latin small letter a with tilde), or \u0061 (Latin small letter a) in combination with combining diacritical mark \u0303 (combining tilde). Now, in Java, in order to match any Unicode letter, I'd look for [\p{L}], but JavaScript doesn't understand that, so I'll have to look for the individual code points (\unnnn). How can I start with an ã and figure out all the various ways it can be represented in Unicode so I can include them in my regular expression in \unnnn format?

Mariano · Accepted Answer

How can I start with an ã and figure out all the various ways it can be represented in Unicode

You're looking for the Unicode Equivalence.

The 2 forms you mentioned are the composed form, and the decomposed form. To get cannonically equivalent Unicode forms, you could use String.prototype.normalize().

Important: Check the link for Browser Compatibility.

str.normalize([form]) accepts the following forms:

NFC — Normalization Form Canonical Composition.
NFD — Normalization Form Canonical Decomposition.
NFKC — Normalization Form Compatibility Composition.
NFKD — Normalization Form Compatibility Decomposition.

Code point sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed.

Sequences that are defined as compatible are assumed to have possibly distinct appearances, but the same meaning in some contexts.

Quote from Wikipedia

-Choose the equivalence form you like.

For example, using the Latin small letter a with tilde in Compatibility Form:

var char = "ã";
var nfkc = char.normalize('NFKC');
var nfkd = char.normalize('NFKD');

// Returns bytes as Unicode escaped sequences
function escapeUnicode(str){
    var i;
    var result = "";
    for( i = 0; i < str.length; ++i){
        var c = str.charCodeAt(i);
        c = c.toString(16).toUpperCase();
        while (c.length < 4) {
            c = "0" + c;
        }
        result += "\u" + c;
    }
    return result;
}

var char = "ã";
var nfkc = char.normalize('NFKC');
var nfkd = char.normalize('NFKD');

document.write('<br />NFKC: ' + escapeUnicode(nfkc));
document.write('<br />NFKD: ' + escapeUnicode(nfkd));

Unicode: How to obtain all code points for a character like e.g. ã (so it can be used in JavaScript regex)?

Tags:

javascript

regex

unicode

Christian

1 Answers

Mariano

Recent Activity

Donate For Us

Unicode: How to obtain all code points for a character like e.g. ã (so it can be used in JavaScript regex)?

Tags:

javascript

regex

unicode

Christian

1 Answers

Mariano

Related questions

Recent Activity

Donate For Us