Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I use regular expression for unicode string in python?

Hi I wanna use regular expression for unicode utf-8 in following string:

</td><td>عـــــــــــادي</td><td> 40.00</td>

I want to pick "عـــــــــــادي" out, how Can I do this?

My code for this is :

state = re.findall(r'td>...</td',s)

Thanks

like image 479
Mahdi Avatar asked Feb 25 '12 17:02

Mahdi


2 Answers

I ran across something similar when trying to match a string in Russian. For your situation, Michele's answer works fine. If you want to use special sequences like \w and \s, though, you have to change some things. I'm just sharing this, hoping it will be useful to someone else.

>>> string = u"</td><td>Я люблю мороженое</td><td> 40.00</td>"

Make your string unicode by placing a u before the quotation marks

>>> pattern = re.compile(ur'>([\w\s]+)<', re.UNICODE)

Set the flag to unicode, so that it will match unicode strings as well (see docs).

(Alternatively, you can use your local language to set a range. For Russian this would be [а-яА-Я], so:

pattern = re.compile(ur'>([а-яА-Я\s]+)<')

In that case, you don't have to set a flag anymore, since you're not using a special sequence.)

>>> match = pattern.findall(string)
>>> for i in match:
...     print i
... 
Я люблю мороженое
like image 145
Stefan van den Akker Avatar answered Sep 28 '22 09:09

Stefan van den Akker


According to PEP 0264: Defining Python Source Code Encodings, first you need to tell Python the whole source file is UTF-8 encoded by adding a comment like this to the first line:

# -*- coding: utf-8 -*-

Furthermore, try adding 'ur' before the string so that it's raw and Unicode:

state = re.search(ur'td>([^<]+)</td',s)
res = state.group(1)

I've also edited your regex to make it match. Three dots mean "exactly three characters", but since you are using UTF-8, which is a multi-byte encoding, this may not work as expected.

like image 41
Michele Spagnuolo Avatar answered Sep 28 '22 08:09

Michele Spagnuolo