I'm aware that Python 3 fixes a lot of UTF issues, I am not however able to use Python 3, I am using 2.5.1
I'm trying to regex a document but the document has UTF hyphens in it – rather than -. Python can't match these and if I put them in the regex it throws a wobbly.
How can I force Python to use a UTF string or in some way match a character such as that?
Thanks for your help
UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. UTF stands for “Unicode Transformation Format”, and the '8' means that 8-bit values are used in the encoding.
Use bytes.decode(encoding) with encoding as "utf8" to decode a UTF-8-encoded byte string bytes .
Python, Java, and Perl all support regex functionality, as do most Unix tools and many text editors.
You have to escape the character in question (–) and put a u in front of the string literal to make it a unicode string.
So, for example, this:
re.compile("–")
becomes this:
re.compile(u"\u2013")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With