How to remove invalid UTF-8 characters from a JavaScript string?

Tags:

I'd like to remove all invalid UTF-8 characters from a string in JavaScript. I've tried with this JavaScript:

strTest = strTest.replace(/([\x00-\x7F]|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF]{2}|[\xF0-\xF7][\x80-\xBF]{3})|./g, "$1");

It seems that the UTF-8 validation regex described here (link removed) is more complete and I adapted it in the same way like:

strTest = strTest.replace(/([\x09\x0A\x0D\x20-\x7E]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2})|./g, "$1");

Both of these pieces of code seem to be allowing valid UTF-8 through, but aren't filtering out hardly any of the bad UTF-8 characters from my test data: UTF-8 decoder capability and stress test. Either the bad characters come through unchanged or seem to have some of their bytes removed creating a new, invalid character.

I'm not very familiar with the UTF-8 standard or with multibyte in JavaScript so I'm not sure if I'm failing to represent proper UTF-8 in the regex or if I'm applying that regex improperly in JavaScript.

Edit: added global flag to my regex per Tomalak's comment - however this still isn't working for me. I'm abandoning doing this on the client side per bobince's comment.

879

asked Apr 19 '10 19:04

Matthew Sielski

2 Answers

I use this simple and sturdy approach:

function cleanString(input) {     var output = "";     for (var i=0; i<input.length; i++) {         if (input.charCodeAt(i) <= 127) {             output += input.charAt(i);         }     }     return output; }

Basically all you really want are the ASCII chars 0-127 so just rebuild the string char by char. If it's a good char, keep it - if not, ditch it. Pretty robust and if if sanitation is your goal, it's fast enough (in fact it's really fast).

189

answered Sep 29 '22 08:09

Ali

JavaScript strings are natively Unicode. They hold character sequences* not byte sequences, so it is impossible for one to contain an invalid byte sequence.

(Technically, they actually contain UTF-16 code unit sequences, which is not quite the same thing, but this probably isn't anything you need to worry about right now.)

You can, if you need to for some reason, create a string holding characters used as placeholders for bytes. ie. using the character U+0080 ('\x80') to stand for the byte 0x80. This is what you would get if you encoded characters to bytes using UTF-8, then decoded them back to characters using ISO-8859-1 by mistake. There is a special JavaScript idiom for this:

var bytelike= unescape(encodeURIComponent(characters));

and to get back from UTF-8 pseudobytes to characters again:

var characters= decodeURIComponent(escape(bytelike));

(This is, notably, pretty much the only time the escape/unescape functions should ever be used. Their existence in any other program is almost always a bug.)

decodeURIComponent(escape(bytes)), since it behaves like a UTF-8 decoder, will raise an error if the sequence of code units fed into it would not be acceptable as UTF-8 bytes.

It is very rare for you to need to work on byte strings like this in JavaScript. Better to keep working natively in Unicode on the client side. The browser will take care of UTF-8-encoding the string on the wire (in a form submission or XMLHttpRequest).

answered Sep 29 '22 09:09

bobince

Related questions
                            
                                Passing data from Django to D3
                            
                                Pass-by-reference JavaScript objects
                            
                                Showing list empty message at the center of the screen in a FlatList using ListHeaderComponent
                            
                                How to enable `ignoreUndefinedProperties` in node js
                            
                                Looping over elements in jQuery
                            
                                How to check if any Arabic character exists in the string ( javascript )
                            
                                How to change the editor size of CKEditor?
                            
                                Force SSL with expressjs 3
                            
                                Properties of Javascript function objects
                            
                                Why does parseInt('dsff66',16) return 13?
                            
                                Object.keys not working in internet Explorer
                            
                                Node npm package throw use strict: command not found after publish and install globaly
                            
                                Lodash map and return unique
                            
                                JavaScript NodeList
                            
                                Google Maps inside iframe not loading
                            
                                How to use Cors anywhere to reverse proxy and add CORS headers
                            
                                Prevent Double tap in React native
                            
                                HTML form - when I hit enter it refreshes page! [duplicate]
                            
                                HTML5 reset video and play again
                            
                                Html datalist values from array in javascript

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to remove invalid UTF-8 characters from a JavaScript string?

Tags:

javascript

regex

utf-8

Matthew Sielski

People also ask

2 Answers

Ali

bobince

Recent Activity

Donate For Us