In python or PHP a simple regex such as <code>/\W/gu</code> matches any non-word character in any script, in javascript however it matches <code>[^A-Za-z0-9_]</code>, what are the correct ranges to match the same characters as python and PHP? https://regex101.com/r/yhNF8U/1/

Generic solution Mathias Bynens suggests to follow the UTS18 recommendation and thus a Unicode-aware <code>\W</code> will look like: <pre class="prettyprint"><code>[^\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}] </code></pre> Please note the comment for the suggested Unicode property class combination: <blockquote> This is only an approximation to Word Boundaries (see b below). The Connector Punctuation is added in for programming language identifiers, thus adding "_" and similar characters. </blockquote> More considerations The <code>\w</code> construct (and thus its <code>\W</code> counterpart), when matching in a Unicode-aware context, matches similar, but somewhat different set of characters across regex engines. For example, here is Non-word character: <code>\W</code> .NET definition: <code>[^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Mn}\p{Pc}\p{Lm}]</code>, where <code>\p{Ll}\p{Lu}\p{Lt}\p{Lo}</code> can be contracted to a sheer <code>\p{L}</code> and the pattern is thus equal to <code>[^\p{L}\p{Nd}\p{Mn}\p{Pc}]</code>. In Android (see documentation), <code>[^\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}\p{IsJoin_Control}]</code>, where <code>\p{gc=Mn}\p{gc=Me}\p{gc=Mc}</code> can be just written as <code>\p{M}</code>. In PHP PCRE, <code>\W</code> matches <code>[^\p{L}\p{N}_]</code>. Rexegg cheat sheet defines Python 3 <code>\w</code> as "Unicode letter, ideogram, digit, or underscore", i.e. <code>[\p{L}\p{Mn}\p{Nd}_]</code>. You may roughly decompose <code>\W</code> as <code>[^\p{L}\p{N}\p{M}\p{Pc}]</code>: <pre class="prettyprint"><code>/[^\p{L}\p{N}\p{M}\p{Pc}]/gu </code></pre> where <ul> <li> <code>[^</code> - is the start of the negated character class that matches a single char other than: <ul> <li> <code>\p{L}</code> - any Unicode letter</li> <li> <code>\p{N}</code> - any Unicode digit</li> <li> <code>\p{M}</code> - a diacritic mark</li> <li> <code>\p{Pc}</code> - a connector punctuation symbol</li> </ul> </li> <li> <code>]</code> - end of the character class.</li> </ul> Note it is <code>\p{Pc}</code> class that matches an underscore. NOTE that <code>\p{Alphabetic}</code> (<code>\p{Alpha}</code>) includes all letters matched by <code>\p{L}</code>, plus letter numbers matched by <code>\p{Nl}</code> (e.g. <code>Ⅻ</code> – a character for the roman number <code>12</code>), plus some other symbols matched with <code>\p{Other_Alphabetic}</code> (<code>\p{OAlpha}</code>). Other variations: <ul> <li> <code>/[^\p{L}0-9_]/gu</code> - to just use <code>\W</code> that is aware of Unicode letters only</li> <li> <code>/[^\p{L}\p{N}_]/gu</code> - (PCRE <code>\W</code> style) to just use <code>\W</code> that is aware of Unicode letters and digits only.</li> </ul> Note that Java's <code>(?U)\W</code> will match a mix of what <code>\W</code> matches in PCRE, Python and .NET.

What's the correct regex range for javascript's regexes to match all the non word characters in any script?

Tags:

javascript

regex

In python or PHP a simple regex such as /\W/gu matches any non-word character in any script, in javascript however it matches [^A-Za-z0-9_], what are the correct ranges to match the same characters as python and PHP?

https://regex101.com/r/yhNF8U/1/

787

asked Jul 07 '20 09:07

DannyM

1 Answers

Generic solution

Mathias Bynens suggests to follow the UTS18 recommendation and thus a Unicode-aware \W will look like:

[^\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]

Please note the comment for the suggested Unicode property class combination:

This is only an approximation to Word Boundaries (see b below). The Connector Punctuation is added in for programming language identifiers, thus adding "_" and similar characters.

More considerations

The \w construct (and thus its \W counterpart), when matching in a Unicode-aware context, matches similar, but somewhat different set of characters across regex engines.

For example, here is Non-word character: \W .NET definition: [^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Mn}\p{Pc}\p{Lm}], where \p{Ll}\p{Lu}\p{Lt}\p{Lo} can be contracted to a sheer \p{L} and the pattern is thus equal to [^\p{L}\p{Nd}\p{Mn}\p{Pc}].

In Android (see documentation), [^\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}\p{IsJoin_Control}], where \p{gc=Mn}\p{gc=Me}\p{gc=Mc} can be just written as \p{M}.

In PHP PCRE, \W matches [^\p{L}\p{N}_].

Rexegg cheat sheet defines Python 3 \w as "Unicode letter, ideogram, digit, or underscore", i.e. [\p{L}\p{Mn}\p{Nd}_].

You may roughly decompose \W as [^\p{L}\p{N}\p{M}\p{Pc}]:

/[^\p{L}\p{N}\p{M}\p{Pc}]/gu

where

[^ - is the start of the negated character class that matches a single char other than:
- \p{L} - any Unicode letter
- \p{N} - any Unicode digit
- \p{M} - a diacritic mark
- \p{Pc} - a connector punctuation symbol
] - end of the character class.

Note it is \p{Pc} class that matches an underscore.

NOTE that \p{Alphabetic} (\p{Alpha}) includes all letters matched by \p{L}, plus letter numbers matched by \p{Nl} (e.g. Ⅻ – a character for the roman number 12), plus some other symbols matched with \p{Other_Alphabetic} (\p{OAlpha}).

Other variations:

/[^\p{L}0-9_]/gu - to just use \W that is aware of Unicode letters only
/[^\p{L}\p{N}_]/gu - (PCRE \W style) to just use \W that is aware of Unicode letters and digits only.

Note that Java's (?U)\W will match a mix of what \W matches in PCRE, Python and .NET.

128

answered Oct 07 '22 15:10

Wiktor Stribiżew

Related questions
                            
                                How to check if a function/method/class is built-in Python?
                            
                                PyTorch running out of memory: DefaultCPUAllocator can't allocate memory
                            
                                In vscode using node.js, ctrl+F5 always asks for “select environment”. This didn't happen a few weeks ago
                            
                                Both cmath and numpy give "incorrect" value of asin(10)
                            
                                How does CGO_ENABLED affect dynamic vs static linking?
                            
                                Since PHP 7.3 `array_unshift()` can be called with only one parameter. What's the point?
                            
                                Python3 Does input order matter for the .intersection() function in terms of runtime?
                            
                                Kotlin's REPL println not printing to new line, instead prints everything to same line
                            
                                Avoiding loops when using NumPy's sum
                            
                                In Java, can one get away with using "raw unparameterised class"-es instead of using dummy interfaces?
                            
                                aws sts get-session-token ... --token-code ... fails with InvalidClientTokenId, but MFA console login working
                            
                                Mypy: How should I type a dict that has strings as keys and the values can be either strings or lists of strings?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With