Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I combine a character followed by a "combining accent" into a single character?

How can I combine a character followed by a "combining accent" into a single character?

I'm taking a phrase that the user enters into a web page and submitting it to a French-English dictionary. Sometimes the dictionary lookup would fail because there are two representations for most accented characters. For example:

  • é can be done in a single character: \xE9 (latin small letter e with acute).
  • But it an also be represented by two characters: e + \u0301 (combining acute accent).

I always want to submit the former (single character) to the dictionary.

Right now, I'm doing that by replacing every two-character occurrence I find with the equivalent single character. But is there a simpler (i.e. one-line) way to do this, either in JavaScript or in the browser when its fetched form the input field?

function translate(phrase) {
    // Combine accents into a single accented character, if necessary.
    var TRANSFORM = [
        // Acute accent.
        [/E\u0301/g, "\xC9"], // É
        [/e\u0301/g, "\xE9"], // é

        // Grave accent.
        [/a\u0300/g, "\xE0"], // à
        [/e\u0300/g, "\xE8"], // è
        [/u\u0300/g, "\xF9"], // ù

        // Cedilla (no combining accent).

        // Circumflex.
        [/a\u0302/g, "\xE2"], // â
        [/e\u0302/g, "\xEA"], // ê
        [/i\u0302/g, "\xEE"], // î
        [/o\u0302/g, "\xF4"], // ô
        [/u\u0302/g, "\xFB"], // û

        // Trema.
        [/e\u0308/g, "\xEB"], // ë
        [/i\u0308/g, "\xEF"], // ï
        [/u\u0308/g, "\xFC"] // ü

        // oe ligature (no combining accent).
    ];
    for (var i = 0; i < TRANSFORM.length; i++)
        phrase = phrase.replace(TRANSFORM[i][0], TRANSFORM[i][1]);

    // Do translation.
    ...
}
like image 528
Mike M. Lin Avatar asked May 05 '14 16:05

Mike M. Lin


1 Answers

This is called normalization, it looks like you want NFC normalization:

Characters are decomposed and then recomposed by canonical equivalence.

Or in other words, it replaces any combined characters with the single character equivalent.

This is built into ECMAScript 6 as String.prototype.normalize, so if you are fine only supporting newer browsers you could just do the following:

phrase = phrase.normalize('NFC');

To support older browsers as well, it looks like this library does what you want:
https://github.com/walling/unorm

Usage would be phrase = UNorm.nfc(phrase).

like image 131
Andrew Clark Avatar answered Sep 21 '22 00:09

Andrew Clark