In python or PHP a simple regex such as /\W/gu
matches any non-word character in any script, in javascript however it matches [^A-Za-z0-9_]
, what are the correct ranges to match the same characters as python and PHP?
https://regex101.com/r/yhNF8U/1/
The uppercase counterpart \W (non-word-character) matches any single character that doesn't match by \w (same as [^a-zA-Z0-9_] ). In regex, the uppercase metacharacter is always the inverse of the lowercase counterpart. \d (digit) matches any single digit (same as [0-9] ).
?= is a positive lookahead, a type of zero-width assertion. What it's saying is that the captured match must be followed by whatever is within the parentheses but that part isn't captured. Your example means the match needs to be followed by zero or more characters and then a digit (but again that part isn't captured).
Using character sets For example, the regular expression "[ A-Za-z] " specifies to match any single uppercase or lowercase letter. In the character set, a hyphen indicates a range of characters, for example [A-Z] will match any one capital letter.
Basically (0+1)* mathes any sequence of ones and zeroes. So, in your example (0+1)*1(0+1)* should match any sequence that has 1. It would not match 000 , but it would match 010 , 1 , 111 etc. (0+1) means 0 OR 1.
RegExp ObjectA regular expression is a pattern of characters. The pattern is used to do pattern-matching "search-and-replace" functions on text. In JavaScript, a RegExp Object is a pattern with Properties and Methods.
In JavaScript, you can write RegExp patterns using simple patterns, special characters, and flags. In this section, we'll explore the different ways to write regular expressions while focusing on simple patterns, special characters, and flags.
Generic solution
Mathias Bynens suggests to follow the UTS18 recommendation and thus a Unicode-aware \W
will look like:
[^\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]
Please note the comment for the suggested Unicode property class combination:
This is only an approximation to Word Boundaries (see b below). The Connector Punctuation is added in for programming language identifiers, thus adding "_" and similar characters.
More considerations
The \w
construct (and thus its \W
counterpart), when matching in a Unicode-aware context, matches similar, but somewhat different set of characters across regex engines.
For example, here is Non-word character: \W
.NET definition: [^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Mn}\p{Pc}\p{Lm}]
, where \p{Ll}\p{Lu}\p{Lt}\p{Lo}
can be contracted to a sheer \p{L}
and the pattern is thus equal to [^\p{L}\p{Nd}\p{Mn}\p{Pc}]
.
In Android (see documentation), [^\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}\p{IsJoin_Control}]
, where \p{gc=Mn}\p{gc=Me}\p{gc=Mc}
can be just written as \p{M}
.
In PHP PCRE, \W
matches [^\p{L}\p{N}_]
.
Rexegg cheat sheet defines Python 3 \w
as "Unicode letter, ideogram, digit, or underscore", i.e. [\p{L}\p{Mn}\p{Nd}_]
.
You may roughly decompose \W
as [^\p{L}\p{N}\p{M}\p{Pc}]
:
/[^\p{L}\p{N}\p{M}\p{Pc}]/gu
where
[^
- is the start of the negated character class that matches a single char other than:
\p{L}
- any Unicode letter\p{N}
- any Unicode digit\p{M}
- a diacritic mark\p{Pc}
- a connector punctuation symbol]
- end of the character class.Note it is \p{Pc}
class that matches an underscore.
NOTE that \p{Alphabetic}
(\p{Alpha}
) includes all letters matched by \p{L}
, plus letter numbers matched by \p{Nl}
(e.g. Ⅻ
– a character for the roman number 12
), plus some other symbols matched with \p{Other_Alphabetic}
(\p{OAlpha}
).
Other variations:
/[^\p{L}0-9_]/gu
- to just use \W
that is aware of Unicode letters only/[^\p{L}\p{N}_]/gu
- (PCRE \W
style) to just use \W
that is aware of Unicode letters and digits only.Note that Java's (?U)\W
will match a mix of what \W
matches in PCRE, Python and .NET.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With