Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

check if javascript string is valid UTF-8

A user can copy and paste into a textarea html input and sometimes is pasting invalid UTF-8 characters, for example, a copy and paste from a rtf file that contains tabs.

How can I check if a string is a valid UTF-8?

like image 674
eNddy Avatar asked Mar 30 '16 16:03

eNddy


People also ask

How do I know if a string is UTF-8?

Valid UTF8 has a specific binary format. If it's a single byte UTF8 character, then it is always of form '0xxxxxxx', where 'x' is any binary digit. If it's a two byte UTF8 character, then it's always of form '110xxxxx10xxxxxx'.

Are JavaScript strings UTF-8?

It helps, though, when you realize that Javascript string types will always be encoded as UTF-16, and most of the other places strings in RAM interact with sockets, files, or byte arrays, the string gets re-encoded as UTF-8.

Does JavaScript use UTF-8 or UTF-16?

UTF-16 is used by systems such as the Microsoft Windows API (which also supports UTF-8 though), the Java programming language and JavaScript/ECMAScript. It is also sometimes used for plain text and word-processing data files on Microsoft Windows. It is rarely used for files on Unix-like systems.

What characters are not allowed in UTF-8?

Yes. 0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF are invalid UTF-8 code units.


2 Answers

I think you misunderstand what "UTF-8 characters" means. UTF-8 is an encoding of Unicode which can represent pretty-much every single character and glyph that has ever existed in recorded human history, so that extent there are no "invalid" UTF-8 characters.

RTF is a formatting system which works independently of the underlying encoding system - you can use RTF with ASCII, UTF-8, UTF-16 and others. Textboxes in HTML only respect plain text, so any RTF formatting will be automatically stripped (unless you're using a "rich-edit" component, which I assume you're not).

But you do describe things like whitespace characters (like tabs: \t) are represented in Unicode (and so, UTF-8). A string containing those characters is still "valid UTF-8", it's just invalid as far as your business-requirements are concerned.

I suggest just stripping-out unwanted characters using a regular-expression that matches non-visible characters (from here: Match non printable/non ascii characters and remove from text )

textBoxContent = textBoxContent.replace(/[^\x20-\x7E]+/g, '');

The expression [^\x20-\x7E] matches any character NOT in the codepoint range 0x20 (32, a normal space character ' ') to 0x7E (127, the tidle '~' character), all others will be removed.

Unicode's first 127 codepoints are identical to ASCII and can be seen here: http://www.asciitable.com/

like image 52
Dai Avatar answered Oct 30 '22 16:10

Dai


Just an idea:

function checkUTF8(text) {
    var utf8Text = text;
    try {
        // Try to convert to utf-8
        utf8Text = decodeURIComponent(escape(text));
        // If the conversion succeeds, text is not utf-8
    }catch(e) {
        // console.log(e.message); // URI malformed
        // This exception means text is utf-8
    }   
    return utf8Text; // returned text is always utf-8
}
like image 32
Daniel Rodriguez Avatar answered Oct 30 '22 18:10

Daniel Rodriguez