Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Javascript toLowerCase strange behaviour

I have a small application that reads tweets and tries to match keywords and I noticed this strange behaviour with a particular string:

var text = "The Νіk​е D​un​k​ Ніgh ЅΒ 'Uglу Ѕwеаt​еr​' іѕ n​оw аvаіlаblе http://swoo.sh/IHVaTL";
var lowercase = text.toLowerCase()

Now the value of lowercase is:

the νіk​е d​un​k​ ніgh ѕβ 'uglу ѕwеаt​еr​' іѕ n​оw аvаіlаblе http://swoo.sh/ihvatl

So it seems like the string is in a weird format, I double checked some of the letters and found that:

text.charAt(4)
>"N"
text.charCodeAt(5)
>925
'N'.charCodeAt(0)
>78

So even if it looks like a normal N, the unicode associated to it corresponds to

0925 थ DEVANAGARI LETTER THA

according to the unicode chart

So I´m a bit puzzled about how this can happen, and if there is anyway to "convert" to the supposed real letter

like image 927
jasalguero Avatar asked Dec 15 '13 23:12

jasalguero


2 Answers

There is a python library called unidecode that I've used to solve this problem in python before, it basically "flattens" unicode into ascii.

A quick google reveals that a similar library is available for JavaScript.

like image 99
willy Avatar answered Oct 19 '22 02:10

willy


You can create a separate canvas with each Latin letter, upper case and lower case, to compare against. Each time you encounter a character that's not in the Latin-1 range, create a new canvas for it, and compare it against each Latin alphabet character using an image diff algorithm. Replace the non-Latin character with the closest match.

For example:

var latinize = (function () {
    var latinLetters = [],
        canvases = [],
        size = 16,
        halfSize = size >> 1;

    function makeCanvas(chr) {
        var canvas = document.createElement('canvas'),
            context = canvas.getContext('2d');

        canvas.width = size;
        canvas.height = size;
        context.textBaseline = 'middle';
        context.textAlign = 'center';
        context.font = (halfSize) + "px sans-serif";
        context.fillText(chr, halfSize, halfSize);

        return context;
    }

    function nextChar(chr) {
        return String.fromCharCode(chr.charCodeAt(0) + 1);
    }

    function setupRange(from, to) {
        for (var chr = from; chr <= to; chr = nextChar(chr)) {
            latinLetters.push(chr);
            canvases.push(makeCanvas(chr));
        }
    }

    function calcDistance(ctxA, ctxB) {
        var distance = 0,
            dataA = ctxA.getImageData(0, 0, size, size).data,
            dataB = ctxB.getImageData(0, 0, size, size).data;

        for (var i = dataA.length; i--;) {
            distance += Math.abs(dataA[i] - dataB[i]);
        }

        return distance;
    }

    setupRange('a', 'z');
    setupRange('A', 'Z');
    setupRange('', ''); // ignore blank characters

    return function (text) {
        var result = "",
            scores, canvas;

        for (var i = 0; i < text.length; i++) {
            if (text.charCodeAt(i) < 128) {
                result += text.charAt(i);
                continue;
            }
            scores = [];
            canvas = makeCanvas(text.charAt(i));
            for (var j = 0; j < canvases.length; j++) {
                scores.push({
                    glyph: latinLetters[j],
                    score: calcDistance(canvas, canvases[j])
                });
            }
            scores.sort(function (a, b) {
                return a.score - b.score;
            });
            result += scores[0].glyph;
        }

        return result;
    }
}());

This translates your test string to "the nike dunk high sb 'ugly sweater' is now available".

The alternative is to create a giant data structure mapping all of the look-alike characters to their Latin-1 equivalents, as the library in @willy's answer does. This is extremely heavy for "browser JavaScript", and probably not suitable for sending to the client, as you can see by looking at the source for that project.

http://jsfiddle.net/Ly5Lt/4/

like image 1
Dagg Nabbit Avatar answered Oct 19 '22 04:10

Dagg Nabbit