How can I match a letter from any language using a regex in python 3?
re.match([a-zA-Z])
will match the english language characters but I want all languages to be supported simultaneously.
I don't wish to match the '
in can't
or underscores or any other type of formatting. I do wish my regex to match: c
, a
, n
, t
, Å
, é
, and 中
.
For Unicode regex work in Python, I very strongly recommend the following:
regex
library instead of standard re
, which is not really suitable for Unicode regular expressions..encode
and such, you’re almost certainly doing something wrong.Once you do this, you can safely write patterns that include \w
or \p{script=Latin}
or \p{alpha}
and \p{lower}
etc and know that these will all do what the Unicode Standard says they should. I explain all of this business of Python Unicode regex business in much more detail in this answer. The short story is to always use regex
not re
.
For general Unicode advice, I also have several talks from last OSCON about Unicode regular expressions, most of which apart from the 3rd talk alone is not about Python, but much of which is adaptable.
Finally, there’s always this answer to put the fear of God (or at least, of Unicode) in your heart.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With