Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sequence of logical OR in ES6/Unicode regular expression in Chrome ✗ vs Firefox ✓

Consider the following Unicode-heavy regular expression (emoji standing in for non-ASCII and extra-BMP characters):

'🍤🍦🍋🍋🍦🍤'.match(/🍤|🍦|🍋/ug)

Firefox returns [ "🍤", "🍦", "🍋", "🍋", "🍦", "🍤" ] 🤗.

Chrome 52.0.2743.116 and Node 6.4.0 both return null! It doesn’t seem to care if I put the string in a variable and do str.match(…), nor if I build a RegExp object via new RegExp('🍤|🍦|🍋', 'gu').

(Chrome is ok with just ORing two sequences: '🍤🍦🍋🍋🍦🍤'.match(/🍤|🍦/ug) is ok. It’s also ok with non-Unicode: 'aakkzzkkaa'.match(/aa|kk|zz/ug) works.)

Am I doing something wrong? Is this a Chrome bug? The ECMAScript compatibility table says I should be ok with Unicode regexps.

(PS: The three emoji used in this example are just stand-ins. In my application, they’ll be arbitrary but distinct strings. But I wonder if the fact that '🍤🍦🍋🍋🍦🍤'.match(/[🍤🍦🍋]/ug) works in Chrome is relevant?)


Update Marked fixed on 12 April 2017 in Chromium and downstream (including Chrome and Node).

like image 758
Ahmed Fasih Avatar asked Aug 25 '16 18:08

Ahmed Fasih


People also ask

How do I match a pattern in regex?

Most characters, including all letters ( a-z and A-Z ) and digits ( 0-9 ), match itself. For example, the regex x matches substring "x" ; z matches "z" ; and 9 matches "9" . Non-alphanumeric characters without special meaning in regex also matches itself. For example, = matches "=" ; @ matches "@" .

What is difference [] and () in regex?

[] denotes a character class. () denotes a capturing group. [a-z0-9] -- One character that is in the range of a-z OR 0-9.

Does JavaScript regex support Unicode?

As mentioned in other answers, JavaScript regexes have no support for Unicode character classes.

What is metacharacters in regular expression JavaScript?

The RegExp \s Metacharacter in JavaScript is used to find the whitespace characters. The whitespace character can be a space/tab/new line/vertical character. It is same as [ \t\n\r].


2 Answers

Without the u flag, your regexp works, and this is no wonder, since in the BMP (=no "u") mode it compares 16-bit "units" to 16-bit "units", that is, a surrogate pair to another surrogate pair.

The behaviour in the "u" mode (which is supposed to compare codepoints and not units) looks indeed like a Chrome bug, in the meantime you can enclose each alternative in a group, which seems to work fine:

m = '🍤🍦🍋🍋🍦🍤'.match(/(🍤)|(🍦)|(🍋)/ug)
console.log(m)

// note that the groups must be capturing!
// this doesn't work:

m = '🍤🍦🍋🍋🍦🍤'.match(/(?:🍤)|(?:🍦)|(?:🍋)/ug)
console.log(m)

And here's a quick proof that more than two SMP alternatives are broken in the u mode:

// insert a whatever range 
// from https://en.wikipedia.org/wiki/Plane_(Unicode)#Supplementary_Multilingual_Plane
var range = '11300-1137F';

range = range.split('-').map(x => parseInt(x, 16))

var chars = [];
for (var i = range[0]; i <= range[1]; i++) {
    chars.push(String.fromCodePoint(i))
}

var str = chars.join('');

while(chars.length) {
    var re = new RegExp(chars.join('|'), 'u')
    if(str.match(re))
        console.log(chars.length, re);
    chars.pop();
}

In Chrome, it only logs the last two regexes (2 and 1 alts).

like image 196
georg Avatar answered Oct 31 '22 18:10

georg


without the "u"-flag it does also work in chrome (52.0.2743.116) for me

well u-flag seems to be broken

unless you use multiplier '🍤🍤🍦🍦🍦🍦🍋🍋🍋🍋🍦🍦🍦🍦🍤🍤'.match(/🍤|🍦{2}|🍋/g) -> null {1} and {1,} seem to work, I assume they are translated into ? and +. I assume without the "u"-flag 🍦{2} is interpreted as \ud83c\udf66{2}, wich would explain the behaviour.

just tested with (?:🍦){2} this seems to work right. I guess this confirms my assumption about the multiplier.

here a quick fix for that:

//a utility I usually have in my codes
var replace = (pattern, replacement) => value => String(value).replace(pattern, replacement);

var fixRegexSource = replace(
    /[\ud800-\udbff][\udc00-\udfff]/g, 
    //"(?:$&)" //not sure wether this might still be buggy
    //that's why I convert it into the unicode-syntax,
    //this can't be misinterpreted
    c => `(?:\\u${c.charCodeAt(0).toString(16)}\\u${c.charCodeAt(1).toString(16)})`
);

var fixRegex = regex => new RegExp(
    fixRegexSource(regex.source), 
    regex.flags.replace("u", "")
);

sry, didn't come up with better function-names

like image 31
Thomas Avatar answered Oct 31 '22 16:10

Thomas