Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to determine whether a set of characters in JavaScript is UTF-8 or not?

This is sort of a variation of previously asked questions, but I still am unable to find an answer, so I'm trying to distill it to the core of the problem in hopes there is a solution.

I have a database in which, for historical reasons, certain text entries are not UTF-8. Most are. And all entries made the last 3 years are. But some older entries are not.

It is important to find the non-UTF-8 characters so I can either avoid them or convert them to UTF-8 for some XML I'm trying to generate.

The server-side JavaScript I'm using has a ByteBuffer type, so I can treat any set of characters as individual bytes and examine them as needed, and do not need to use the String type, which I understand is problematic in this situation.

Is there any check of text I can do to determine if it is valid UTF-8 or not in this case?

I've been searching for a couple of months now (;_;) and still have not been able to find an answer. Yet there must be a way of doing it, because XML validators (like in the major browsers) are able to report "encoding errors" when they run across non-UTF-8 characters.

I would just like to know any algorithm for how that is done so I can try to do the same sort of test in JavaScript. Once I know which characters are bad I can convert them from ISO-8859-1 (for example) to UTF-8. I have methods for that.

I just don't know how to figure out which characters are not UTF-8. Again, I understand that using the JavaScript String type is problematic in this situation, but I do have an alternative ByteBuffer type which can handle characters on a per byte basis.

Thanks for any specific tests people can suggest.

doug

like image 603
Doug Lerner Avatar asked Nov 11 '22 12:11

Doug Lerner


1 Answers

I have the same situation and problem. All server side JavaScript strings are 16 bit, but if I get a JSON from an endpoint it can be: UTF-8, ANSI (ASCII), UCS2_BE, UCS2_LE. UTF16 is naturally converted nicely to a JavaScript 16 bit string, and that’s a problem, since variable length character encoding will cause SQL injection errors in AWS. The server side JavaScript that I use will however do some bit shifting or padding for UTF-8 that results in a 16 bit JavaScript string starting with  That’s good, since I don’t have 8bit strings in JavaScript I just check for the 3 first chars being 

You may not have the same luck with the bitshifting, but the below function worked for me. I’m sure there is a nicer, faster better solution but this post has been out for 2 years, 715 views and not a single solution.

Anders

Just call it:

var bolResult = isEncoded(strJSON);

/**
 * @description Check if string is UTF8 encoded
 * @param {string} JSON
 * @returns {boolean} true/false
 */
function isEncoded(strJSON) {
        /***************************
         * Valid string starts with:
         * {
         * 239, 187, 191
         ********************/
        var intCharCode0 = strJSON.charCodeAt(0);   //239
        var intCharCode1 = strJSON.charCodeAt(1);   //187
        var intCharCode2 = strJSON.charCodeAt(2);   //191

        if(intCharCode0 === 239 && intCharCode1 === 187 && intCharCode2 === 191){
            return true;
        }
        else{
            return false;
        }
}
like image 67
Anders Avatar answered Nov 15 '22 08:11

Anders