Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove Unicode characters within various ranges in javascript

I'm trying to remove every Unicode character in a string if it falls in any the ranges below.

\uD800-\uDFFF
\u1D800-\u1DFFF
\u2D800-\u2DFFF
\u3D800-\u3DFFF
\u4D800-\u4DFFF
\u5D800-\u5DFFF
\u6D800-\u6DFFF
\u7D800-\u7DFFF
\u8D800-\u8DFFF
\u9D800-\u9DFFF
\uAD800-\uADFFF
\uBD800-\uBDFFF
\uCD800-\uCDFFF
\uDD800-\uDDFFF
\uED800-\uEDFFF
\uFD800-\uFDFFF
\u10D800-\u10DFFF

As an initial prototype, I tried to just remove characters within the first range by using a regex in the replace function.

var buffer = "he\udfffllo world";
var output = buffer.replace(/[\ud800-\udfff]/g, "");
d.innerText = buffer + " is replaced with " + output;

In this case, the character seems to have been replaced fine.

However, when I replace that with

var buffer = "he\udfffllo worl\u1dfffd";
var output = buffer.replace(/[\ud800-\udfff\u1d800-\u1dfff]/g, "");
d.innerText = buffer + " is replaced with " + output;

I see something unexpected. My output shows up as:

he�llo worl᷿fd is replaced with

There are two things to note here:

  1. \u1dfff does not show up as one character - \u1dff gets converted to a character and the f at the end it treated as its own character
  2. the result is an empty string.

Any suggestions on how I can accomplish this would be much appreciated.


EDIT

My overall goal is to filter out all characters that the encodeURIComponent function considers invalid. I ran some tests and found the list above to be the set of characters that a invalid. For instance, the code below, which first converts 1dfff to a unicode character before passing that to encodeURIComponent causes an exception to be raised by the latter function.

var v = String.fromCharCode(122879);
var uriComponent = encodeURIComponent(v);

I edited parts of the question after @Blender pointed out that i was using x instead of u in my code to represent Unicode characters.


EDIT 2

I investigated my technique for fetching the "invalid" unicode ranges further, and as it turns out, if you give String.fromCharacterCode a number that's larger than 16 bits, it'll just look at the lowest 16 bits of the number. That explains the pattern I was seeing. So as it turns out, I only need to worry about the first range.

like image 225
K Mehta Avatar asked Oct 03 '22 23:10

K Mehta


1 Answers

It seems you're trying to remove Unicode surrogate code units from the string. However, only U+D800 through U+DFFF are surrogate code points; the remaining values you name are not, and could be allocated to valid Unicode characters. In that case, the following will suffice (use \u rather than \x to refer to Unicode characters):

buffer.replace(/[\ud800-\udfff]/g, "");
like image 75
Peter O. Avatar answered Oct 13 '22 11:10

Peter O.