Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex match 4 bytes unicode characters

I have a regex which can find all 4 byte unicode characters in a string. I would like to make the following compatible with all popular browsers.

The following code works fine in Chrome and Firefox, but Safari throws "Invalid regular expression: range out of order in character class"

var match = 'aaa😚aaa'.match(/[\u{10000}-\u{10FFFF}]/gu);

So my questions is how should I change the regexp to be able to match all 4 byte unicode characters in a string and without the use of the unicode feature of regex.

like image 764
Roland Soós Avatar asked Mar 21 '17 09:03

Roland Soós


People also ask

Can you use Unicode in regex?

To match a specific Unicode code point, use \uFFFF where FFFF is the hexadecimal number of the code point you want to match. You must always specify 4 hexadecimal digits E.g. \u00E0 matches à, but only when encoded as a single code point U+00E0. Perl, PCRE, Boost, and std::regex do not support the \uFFFF syntax.

Are characters 4 bytes?

Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8. These code points are the same as those in ASCII CCSID 367. Any other character is encoded with more than 1 byte in UTF-8.

What does \u mean in regex?

U (Unicode dependent), and re. X (verbose), for the entire regular expression. (The flags are described in Module Contents.) This is useful if you wish to include the flags as part of the regular expression, instead of passing a flag argument to the re.

What is the regex for Unicode paragraph seperator?

\u000d — Carriage return — \r. \u2028 — Line separator. \u2029 — Paragraph separator.


1 Answers

Safari does not support ES6 regular expression syntax. All you can do is transpile the regex to conform with the ES5 regex syntax:

console.log('aaa😚aaa'.match(/(?:[\uD800-\uDBFF][\uDC00-\uDFFF])/g));
like image 193
Wiktor Stribiżew Avatar answered Nov 07 '22 12:11

Wiktor Stribiżew