Perl and some other current regex engines support Unicode properties, such as the category, in a regex. E.g. in Perl you can use \p{Ll}
to match an arbitrary lower-case letter, or p{Zs}
for any space separator. I don't see support for this in either the 2.x nor 3.x lines of Python (with due regrets). Is anybody aware of a good strategy to get a similar effect? Homegrown solutions are welcome.
This will make your regular expressions work with all Unicode regex engines. In addition to the standard notation, \p{L}, Java, Perl, PCRE, the JGsoft engine, and XRegExp 3 allow you to use the shorthand \pL. The shorthand only works with single-letter Unicode properties.
\u000d — Carriage return — \r. \u2028 — Line separator. \u2029 — Paragraph separator.
To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).
The regex module (an alternative to the standard re
module) supports Unicode codepoint properties with the \p{}
syntax.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With