Consider the following Unicode-heavy regular expression (emoji standing in for non-ASCII and extra-BMP characters): <pre class="prettyprint"><code>'🍤🍦🍋🍋🍦🍤'.match(/🍤|🍦|🍋/ug) </code></pre> Firefox returns <code>[ "🍤", "🍦", "🍋", "🍋", "🍦", "🍤" ]</code> 🤗. Chrome 52.0.2743.116 and Node 6.4.0 both return <code>null</code>! It doesn’t seem to care if I put the string in a variable and do <code>str.match(…)</code>, nor if I build a RegExp object via <code>new RegExp('🍤|🍦|🍋', 'gu')</code>. (Chrome is ok with just ORing two sequences: <code>'🍤🍦🍋🍋🍦🍤'.match(/🍤|🍦/ug)</code> is ok. It’s also ok with non-Unicode: <code>'aakkzzkkaa'.match(/aa|kk|zz/ug)</code> works.) Am I doing something wrong? Is this a Chrome bug? The ECMAScript compatibility table says I should be ok with Unicode regexps. (PS: The three emoji used in this example are just stand-ins. In my application, they’ll be arbitrary but distinct strings. But I wonder if the fact that <code>'🍤🍦🍋🍋🍦🍤'.match(/[🍤🍦🍋]/ug)</code> works in Chrome is relevant?) <hr> Update Marked fixed on 12 April 2017 in Chromium and downstream (including Chrome and Node).

Without the <code>u</code> flag, your regexp works, and this is no wonder, since in the BMP (=no "u") mode it compares 16-bit "units" to 16-bit "units", that is, a surrogate pair to another surrogate pair. The behaviour in the "u" mode (which is supposed to compare codepoints and not units) looks indeed like a Chrome bug, in the meantime you can enclose each alternative in a group, which seems to work fine: <div class="snippet" data-lang="js" data-hide="false" data-console="true" data-babel="false"> <div class="snippet-code"> <pre class="prettyprint snippet-code-js lang-js prettyprint-override"><code>m = '🍤🍦🍋🍋🍦🍤'.match(/(🍤)|(🍦)|(🍋)/ug) console.log(m) // note that the groups must be capturing! // this doesn't work: m = '🍤🍦🍋🍋🍦🍤'.match(/(?:🍤)|(?:🍦)|(?:🍋)/ug) console.log(m)</code></pre> </div> </div> And here's a quick proof that more than two SMP alternatives are broken in the <code>u</code> mode: <div class="snippet" data-lang="js" data-hide="false" data-console="true" data-babel="false"> <div class="snippet-code"> <pre class="prettyprint snippet-code-js lang-js prettyprint-override"><code>// insert a whatever range // from https://en.wikipedia.org/wiki/Plane_(Unicode)#Supplementary_Multilingual_Plane var range = '11300-1137F'; range = range.split('-').map(x => parseInt(x, 16)) var chars = []; for (var i = range[0]; i <= range[1]; i++) { chars.push(String.fromCodePoint(i)) } var str = chars.join(''); while(chars.length) { var re = new RegExp(chars.join('|'), 'u') if(str.match(re)) console.log(chars.length, re); chars.pop(); }</code></pre> </div> </div> In Chrome, it only logs the last two regexes (2 and 1 alts).

<blockquote> without the "u"-flag it does also work in chrome (52.0.2743.116) for me </blockquote> well <code>u</code>-flag seems to be broken <blockquote> unless you use multiplier <code>'🍤🍤🍦🍦🍦🍦🍋🍋🍋🍋🍦🍦🍦🍦🍤🍤'.match(/🍤|🍦{2}|🍋/g)</code> -> null <code>{1}</code> and <code>{1,}</code> seem to work, I assume they are translated into ? and +. I assume without the "u"-flag <code>🍦{2}</code> is interpreted as <code>\ud83c\udf66{2}</code>, wich would explain the behaviour. </blockquote> just tested with <code>(?:🍦){2}</code> this seems to work right. I guess this confirms my assumption about the multiplier. here a quick fix for that: <pre class="prettyprint"><code>//a utility I usually have in my codes var replace = (pattern, replacement) => value => String(value).replace(pattern, replacement); var fixRegexSource = replace( /[\ud800-\udbff][\udc00-\udfff]/g, //"(?:$&)" //not sure wether this might still be buggy //that's why I convert it into the unicode-syntax, //this can't be misinterpreted c => `(?:\\u${c.charCodeAt(0).toString(16)}\\u${c.charCodeAt(1).toString(16)})` ); var fixRegex = regex => new RegExp( fixRegexSource(regex.source), regex.flags.replace("u", "") ); </code></pre> sry, didn't come up with better function-names

Sequence of logical OR in ES6/Unicode regular expression in Chrome ✗ vs Firefox ✓

Tags:

javascript

regex

node.js

google-chrome

unicode

Consider the following Unicode-heavy regular expression (emoji standing in for non-ASCII and extra-BMP characters):

'🍤🍦🍋🍋🍦🍤'.match(/🍤|🍦|🍋/ug)

Firefox returns [ "🍤", "🍦", "🍋", "🍋", "🍦", "🍤" ] 🤗.

Chrome 52.0.2743.116 and Node 6.4.0 both return null! It doesn’t seem to care if I put the string in a variable and do str.match(…), nor if I build a RegExp object via new RegExp('🍤|🍦|🍋', 'gu').

(Chrome is ok with just ORing two sequences: '🍤🍦🍋🍋🍦🍤'.match(/🍤|🍦/ug) is ok. It’s also ok with non-Unicode: 'aakkzzkkaa'.match(/aa|kk|zz/ug) works.)

Am I doing something wrong? Is this a Chrome bug? The ECMAScript compatibility table says I should be ok with Unicode regexps.

(PS: The three emoji used in this example are just stand-ins. In my application, they’ll be arbitrary but distinct strings. But I wonder if the fact that '🍤🍦🍋🍋🍦🍤'.match(/[🍤🍦🍋]/ug) works in Chrome is relevant?)

Update Marked fixed on 12 April 2017 in Chromium and downstream (including Chrome and Node).

758

asked Aug 25 '16 18:08

Ahmed Fasih

2 Answers

Without the u flag, your regexp works, and this is no wonder, since in the BMP (=no "u") mode it compares 16-bit "units" to 16-bit "units", that is, a surrogate pair to another surrogate pair.

The behaviour in the "u" mode (which is supposed to compare codepoints and not units) looks indeed like a Chrome bug, in the meantime you can enclose each alternative in a group, which seems to work fine:

m = '🍤🍦🍋🍋🍦🍤'.match(/(🍤)|(🍦)|(🍋)/ug)
console.log(m)

// note that the groups must be capturing!
// this doesn't work:

m = '🍤🍦🍋🍋🍦🍤'.match(/(?:🍤)|(?:🍦)|(?:🍋)/ug)
console.log(m)

And here's a quick proof that more than two SMP alternatives are broken in the u mode:

// insert a whatever range 
// from https://en.wikipedia.org/wiki/Plane_(Unicode)#Supplementary_Multilingual_Plane
var range = '11300-1137F';

range = range.split('-').map(x => parseInt(x, 16))

var chars = [];
for (var i = range[0]; i <= range[1]; i++) {
    chars.push(String.fromCodePoint(i))
}

var str = chars.join('');

while(chars.length) {
    var re = new RegExp(chars.join('|'), 'u')
    if(str.match(re))
        console.log(chars.length, re);
    chars.pop();
}

In Chrome, it only logs the last two regexes (2 and 1 alts).

196

answered Oct 31 '22 18:10

georg

without the "u"-flag it does also work in chrome (52.0.2743.116) for me

well u-flag seems to be broken

unless you use multiplier '🍤🍤🍦🍦🍦🍦🍋🍋🍋🍋🍦🍦🍦🍦🍤🍤'.match(/🍤|🍦{2}|🍋/g) -> null {1} and {1,} seem to work, I assume they are translated into ? and +. I assume without the "u"-flag 🍦{2} is interpreted as \ud83c\udf66{2}, wich would explain the behaviour.

just tested with (?:🍦){2} this seems to work right. I guess this confirms my assumption about the multiplier.

here a quick fix for that:

//a utility I usually have in my codes
var replace = (pattern, replacement) => value => String(value).replace(pattern, replacement);

var fixRegexSource = replace(
    /[\ud800-\udbff][\udc00-\udfff]/g, 
    //"(?:$&)" //not sure wether this might still be buggy
    //that's why I convert it into the unicode-syntax,
    //this can't be misinterpreted
    c => `(?:\\u${c.charCodeAt(0).toString(16)}\\u${c.charCodeAt(1).toString(16)})`
);

var fixRegex = regex => new RegExp(
    fixRegexSource(regex.source), 
    regex.flags.replace("u", "")
);

sry, didn't come up with better function-names

answered Oct 31 '22 16:10

Thomas

Related questions
                            
                                Load another jvectormap on region click
                            
                                Trace the execution of ALL Javascript in a web app
                            
                                Basic RequireJS Help - How do I call / define a function? Using onclick and jquery as well
                            
                                Is there an event that fires when a browser scrolls to a named anchor?
                            
                                IE 8 Automatically Closing <header> tag
                            
                                JavaScript completely "tamper safe" variables
                            
                                How can I dynamically add table rows [duplicate]
                            
                                Node.js WriteStream synchronous
                            
                                Why does Chrome raise a mousemove on mousedown?
                            
                                Positioning caret in contenteditable ReactJS components
                            
                                Detect if Chrome browser installation is 64-bit
                            
                                Angular 1.3 + ui-router + generator-ng-poly embedding nested(?) views not working
                            
                                How to detect "request desktop site" mode of iOS Mobile Safari and Chrome?
                            
                                Canonical way to define page objects in Protractor
                            
                                Display passport.js authentication error message in view
                            
                                How to implement Active Directory based SSO with electron?
                            
                                RxJS, how to poll an API to continuously check for updated records using a dynamic timestamp
                            
                                Run js code in a separate context and access its global variable
                            
                                Convert div into downloadable Image
                            
                                Improving image quality of CSS-downscaled elements

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With