Consider the following Unicode-heavy regular expression (emoji standing in for non-ASCII and extra-BMP characters):
'🍤🍦🍋🍋🍦🍤'.match(/🍤|🍦|🍋/ug)
Firefox returns [ "🍤", "🍦", "🍋", "🍋", "🍦", "🍤" ]
🤗.
Chrome 52.0.2743.116 and Node 6.4.0 both return null
! It doesn’t seem to care if I put the string in a variable and do str.match(…)
, nor if I build a RegExp object via new RegExp('🍤|🍦|🍋', 'gu')
.
(Chrome is ok with just ORing two sequences: '🍤🍦🍋🍋🍦🍤'.match(/🍤|🍦/ug)
is ok. It’s also ok with non-Unicode: 'aakkzzkkaa'.match(/aa|kk|zz/ug)
works.)
Am I doing something wrong? Is this a Chrome bug? The ECMAScript compatibility table says I should be ok with Unicode regexps.
(PS: The three emoji used in this example are just stand-ins. In my application, they’ll be arbitrary but distinct strings. But I wonder if the fact that '🍤🍦🍋🍋🍦🍤'.match(/[🍤🍦🍋]/ug)
works in Chrome is relevant?)
Update Marked fixed on 12 April 2017 in Chromium and downstream (including Chrome and Node).
Most characters, including all letters ( a-z and A-Z ) and digits ( 0-9 ), match itself. For example, the regex x matches substring "x" ; z matches "z" ; and 9 matches "9" . Non-alphanumeric characters without special meaning in regex also matches itself. For example, = matches "=" ; @ matches "@" .
[] denotes a character class. () denotes a capturing group. [a-z0-9] -- One character that is in the range of a-z OR 0-9.
As mentioned in other answers, JavaScript regexes have no support for Unicode character classes.
The RegExp \s Metacharacter in JavaScript is used to find the whitespace characters. The whitespace character can be a space/tab/new line/vertical character. It is same as [ \t\n\r].
Without the u
flag, your regexp works, and this is no wonder, since in the BMP (=no "u") mode it compares 16-bit "units" to 16-bit "units", that is, a surrogate pair to another surrogate pair.
The behaviour in the "u" mode (which is supposed to compare codepoints and not units) looks indeed like a Chrome bug, in the meantime you can enclose each alternative in a group, which seems to work fine:
m = '🍤🍦🍋🍋🍦🍤'.match(/(🍤)|(🍦)|(🍋)/ug)
console.log(m)
// note that the groups must be capturing!
// this doesn't work:
m = '🍤🍦🍋🍋🍦🍤'.match(/(?:🍤)|(?:🍦)|(?:🍋)/ug)
console.log(m)
And here's a quick proof that more than two SMP alternatives are broken in the u
mode:
// insert a whatever range
// from https://en.wikipedia.org/wiki/Plane_(Unicode)#Supplementary_Multilingual_Plane
var range = '11300-1137F';
range = range.split('-').map(x => parseInt(x, 16))
var chars = [];
for (var i = range[0]; i <= range[1]; i++) {
chars.push(String.fromCodePoint(i))
}
var str = chars.join('');
while(chars.length) {
var re = new RegExp(chars.join('|'), 'u')
if(str.match(re))
console.log(chars.length, re);
chars.pop();
}
In Chrome, it only logs the last two regexes (2 and 1 alts).
without the "u"-flag it does also work in chrome (52.0.2743.116) for me
well u
-flag seems to be broken
unless you use multiplier
'🍤🍤🍦🍦🍦🍦🍋🍋🍋🍋🍦🍦🍦🍦🍤🍤'.match(/🍤|🍦{2}|🍋/g)
-> null{1}
and{1,}
seem to work, I assume they are translated into ? and +. I assume without the "u"-flag🍦{2}
is interpreted as\ud83c\udf66{2}
, wich would explain the behaviour.
just tested with (?:🍦){2}
this seems to work right. I guess this confirms my assumption about the multiplier.
here a quick fix for that:
//a utility I usually have in my codes
var replace = (pattern, replacement) => value => String(value).replace(pattern, replacement);
var fixRegexSource = replace(
/[\ud800-\udbff][\udc00-\udfff]/g,
//"(?:$&)" //not sure wether this might still be buggy
//that's why I convert it into the unicode-syntax,
//this can't be misinterpreted
c => `(?:\\u${c.charCodeAt(0).toString(16)}\\u${c.charCodeAt(1).toString(16)})`
);
var fixRegex = regex => new RegExp(
fixRegexSource(regex.source),
regex.flags.replace("u", "")
);
sry, didn't come up with better function-names
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With