Hi I wanna use regular expression for unicode utf-8 in following string:
</td><td>عـــــــــــادي</td><td> 40.00</td>
I want to pick "عـــــــــــادي"
out, how Can I do this?
My code for this is :
state = re.findall(r'td>...</td',s)
Thanks
I ran across something similar when trying to match a string in Russian. For your situation, Michele's answer works fine. If you want to use special sequences like \w
and \s
, though, you have to change some things. I'm just sharing this, hoping it will be useful to someone else.
>>> string = u"</td><td>Я люблю мороженое</td><td> 40.00</td>"
Make your string unicode by placing a u
before the quotation marks
>>> pattern = re.compile(ur'>([\w\s]+)<', re.UNICODE)
Set the flag to unicode, so that it will match unicode strings as well (see docs).
(Alternatively, you can use your local language to set a range. For Russian this would be [а-яА-Я]
, so:
pattern = re.compile(ur'>([а-яА-Я\s]+)<')
In that case, you don't have to set a flag anymore, since you're not using a special sequence.)
>>> match = pattern.findall(string)
>>> for i in match:
... print i
...
Я люблю мороженое
According to PEP 0264: Defining Python Source Code Encodings, first you need to tell Python the whole source file is UTF-8 encoded by adding a comment like this to the first line:
# -*- coding: utf-8 -*-
Furthermore, try adding 'ur' before the string so that it's raw and Unicode:
state = re.search(ur'td>([^<]+)</td',s)
res = state.group(1)
I've also edited your regex to make it match. Three dots mean "exactly three characters", but since you are using UTF-8, which is a multi-byte encoding, this may not work as expected.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With