Hi I wanna use regular expression for unicode utf-8 in following string:
</td><td>عـــــــــــادي</td><td> 40.00</td>
I want to pick "عـــــــــــادي" out, how Can I do this?
My code for this is :
state = re.findall(r'td>...</td',s)
Thanks
I ran across something similar when trying to match a string in Russian. For your situation, Michele's answer works fine. If you want to use special sequences like \w and \s, though, you have to change some things. I'm just sharing this, hoping it will be useful to someone else.
>>> string = u"</td><td>Я люблю мороженое</td><td> 40.00</td>"
Make your string unicode by placing a u before the quotation marks
>>> pattern = re.compile(ur'>([\w\s]+)<', re.UNICODE)
Set the flag to unicode, so that it will match unicode strings as well (see docs).
(Alternatively, you can use your local language to set a range. For Russian this would be [а-яА-Я], so:
pattern = re.compile(ur'>([а-яА-Я\s]+)<')
In that case, you don't have to set a flag anymore, since you're not using a special sequence.)
>>> match = pattern.findall(string)
>>> for i in match:
... print i
...
Я люблю мороженое
According to PEP 0264: Defining Python Source Code Encodings, first you need to tell Python the whole source file is UTF-8 encoded by adding a comment like this to the first line:
# -*- coding: utf-8 -*-
Furthermore, try adding 'ur' before the string so that it's raw and Unicode:
state = re.search(ur'td>([^<]+)</td',s)
res = state.group(1)
I've also edited your regex to make it match. Three dots mean "exactly three characters", but since you are using UTF-8, which is a multi-byte encoding, this may not work as expected.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With