How can I use regular expression for unicode string in python?

Question

Hi I wanna use regular expression for unicode utf-8 in following string:

</td><td>عـــــــــــادي</td><td> 40.00</td>

I want to pick "عـــــــــــادي" out, how Can I do this?

My code for this is :

state = re.findall(r'td>...</td',s)

Thanks

Stefan van den Akker · Accepted Answer

I ran across something similar when trying to match a string in Russian. For your situation, Michele's answer works fine. If you want to use special sequences like \w and \s, though, you have to change some things. I'm just sharing this, hoping it will be useful to someone else.

>>> string = u"</td><td>Я люблю мороженое</td><td> 40.00</td>"

Make your string unicode by placing a u before the quotation marks

>>> pattern = re.compile(ur'>([\w\s]+)<', re.UNICODE)

Set the flag to unicode, so that it will match unicode strings as well (see docs).

(Alternatively, you can use your local language to set a range. For Russian this would be [а-яА-Я], so:

pattern = re.compile(ur'>([а-яА-Я\s]+)<')

In that case, you don't have to set a flag anymore, since you're not using a special sequence.)

>>> match = pattern.findall(string)
>>> for i in match:
...     print i
... 
Я люблю мороженое

Michele Spagnuolo · Answer

According to PEP 0264: Defining Python Source Code Encodings, first you need to tell Python the whole source file is UTF-8 encoded by adding a comment like this to the first line:

# -*- coding: utf-8 -*-

Furthermore, try adding 'ur' before the string so that it's raw and Unicode:

state = re.search(ur'td>([^<]+)</td',s)
res = state.group(1)

I've also edited your regex to make it match. Three dots mean "exactly three characters", but since you are using UTF-8, which is a multi-byte encoding, this may not work as expected.

How can I use regular expression for unicode string in python?

Tags:

python

regex

unicode

Mahdi

2 Answers

Stefan van den Akker

Michele Spagnuolo

Recent Activity

Donate For Us

How can I use regular expression for unicode string in python?

Tags:

python

regex

unicode

Mahdi

2 Answers

Stefan van den Akker

Michele Spagnuolo

Related questions

Recent Activity

Donate For Us