Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

JavaScript regex whitespace characters

I have done some searching, but I couldn't find a definitive list of whitespace characters included in the \s in JavaScript's regex.

I know that I can rely on space, line feed, carriage return, and tab as being whitespace, but I thought that since JavaScript was traditionally only for the browser, maybe URL encoded whitespace and things like   and %20 would be supported as well.

What exactly is considered by JavaScript's regex compiler? If there are differences between browsers, I only really care about webkit browsers, but it would be nice to know of any differences. Also, what about Node.js?

like image 579
beatgammit Avatar asked May 20 '11 14:05

beatgammit


1 Answers

Here's an expansion of primvdb's answer, covering the entire 16-bit space, including unicode code point values and a comparison with str.trim(). I tried to edit the answer to improve it, but my edit was rejected, so I had to post this new one.

Identify all single-byte characters which will be matched as whitespace regex \s or by String.prototype.trim():

const regexList = [];
const trimList = [];

for (let codePoint = 0; codePoint < 2 ** 16; codePoint += 1) {
  const str = String.fromCodePoint(codePoint);
  const unicode = codePoint.toString(16).padStart(4, '0');

  if (str.replace(/\s/, '') === '') regexList.push([codePoint, unicode]);
  if (str.trim() === '') trimList.push([codePoint, unicode]);
}

const identical = JSON.stringify(regexList) === JSON.stringify(trimList);
const list = regexList.reduce((str, [codePoint, unicode]) => `${str}${unicode} ${codePoint}\n`, '');

console.log({identical});
console.log(list);

The list (in V8):

0009 9
000a 10
000b 11
000c 12
000d 13
0020 32
00a0 160
1680 5760
2000 8192
2001 8193
2002 8194
2003 8195
2004 8196
2005 8197
2006 8198
2007 8199
2008 8200
2009 8201
200a 8202
2028 8232
2029 8233
202f 8239
205f 8287
3000 12288
feff 65279
like image 143
jsejcksn Avatar answered Sep 23 '22 04:09

jsejcksn