Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF in Python Regex

Tags:

python

regex

I'm aware that Python 3 fixes a lot of UTF issues, I am not however able to use Python 3, I am using 2.5.1

I'm trying to regex a document but the document has UTF hyphens in it – rather than -. Python can't match these and if I put them in the regex it throws a wobbly.

How can I force Python to use a UTF string or in some way match a character such as that?

Thanks for your help

like image 295
Teifion Avatar asked Dec 16 '08 17:12

Teifion


People also ask

Is Python a UTF-8 string?

UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. UTF stands for “Unicode Transformation Format”, and the '8' means that 8-bit values are used in the encoding.

How do I decode a UTF-8 string in Python?

Use bytes.decode(encoding) with encoding as "utf8" to decode a UTF-8-encoded byte string bytes .

Can you do RegEx in Python?

Python, Java, and Perl all support regex functionality, as do most Unix tools and many text editors.


1 Answers

You have to escape the character in question (–) and put a u in front of the string literal to make it a unicode string.

So, for example, this:

re.compile("–") 

becomes this:

re.compile(u"\u2013")
like image 191
Patrick McElhaney Avatar answered Oct 17 '22 08:10

Patrick McElhaney