This is sort of a variation of previously asked questions, but I still am unable to find an answer, so I'm trying to distill it to the core of the problem in hopes there is a solution.
I have a database in which, for historical reasons, certain text entries are not UTF-8. Most are. And all entries made the last 3 years are. But some older entries are not.
It is important to find the non-UTF-8 characters so I can either avoid them or convert them to UTF-8 for some XML I'm trying to generate.
The server-side JavaScript I'm using has a ByteBuffer type, so I can treat any set of characters as individual bytes and examine them as needed, and do not need to use the String type, which I understand is problematic in this situation.
Is there any check of text I can do to determine if it is valid UTF-8 or not in this case?
I've been searching for a couple of months now (;_;) and still have not been able to find an answer. Yet there must be a way of doing it, because XML validators (like in the major browsers) are able to report "encoding errors" when they run across non-UTF-8 characters.
I would just like to know any algorithm for how that is done so I can try to do the same sort of test in JavaScript. Once I know which characters are bad I can convert them from ISO-8859-1 (for example) to UTF-8. I have methods for that.
I just don't know how to figure out which characters are not UTF-8. Again, I understand that using the JavaScript String type is problematic in this situation, but I do have an alternative ByteBuffer type which can handle characters on a per byte basis.
Thanks for any specific tests people can suggest.
doug
I have the same situation and problem. All server side JavaScript strings are 16 bit, but if I get a JSON from an endpoint it can be: UTF-8, ANSI (ASCII), UCS2_BE, UCS2_LE. UTF16 is naturally converted nicely to a JavaScript 16 bit string, and that’s a problem, since variable length character encoding will cause SQL injection errors in AWS. The server side JavaScript that I use will however do some bit shifting or padding for UTF-8 that results in a 16 bit JavaScript string starting with  That’s good, since I don’t have 8bit strings in JavaScript I just check for the 3 first chars being 
You may not have the same luck with the bitshifting, but the below function worked for me. I’m sure there is a nicer, faster better solution but this post has been out for 2 years, 715 views and not a single solution.
Anders
Just call it:
var bolResult = isEncoded(strJSON);
/**
* @description Check if string is UTF8 encoded
* @param {string} JSON
* @returns {boolean} true/false
*/
function isEncoded(strJSON) {
/***************************
* Valid string starts with:
* {
* 239, 187, 191
********************/
var intCharCode0 = strJSON.charCodeAt(0); //239
var intCharCode1 = strJSON.charCodeAt(1); //187
var intCharCode2 = strJSON.charCodeAt(2); //191
if(intCharCode0 === 239 && intCharCode1 === 187 && intCharCode2 === 191){
return true;
}
else{
return false;
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With