Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to match accented characters with a regex in Python?

I need the solutions to this question, except for Python! I've tried installing the regex library for Python, as apparently that enables the use of POSIX expressions in Python's regexes, but nevertheless I guess it does not include Unicode characters in the [:alpha:] class. E.g.:

>>> re.search(r'[[:alpha:] ]+','Please work blåbær and NOW stop 123').group(0)
'Please work bl'

When I want it to match Please work blåbær and NOW stop

EDIT: I am using Python 2.7

EDIT 2: I tried the following:

>>> re.search(re.compile('[\w ]+', re.UNICODE),'Please work blåbær and NOW stop 123').group(0)
'Please work bl\xc3'

Not quite what I wanted (I want to match the part after the first non-ASCII character too), but at least it matched on character more than before. What should I be doing here to get it to match the rest of what I want?

EDIT 3: I don't want to match any non-"word" characters; by "word" I mean a-z, A-Z, space, and any accented variations of word characters. I hope I got my idea across; in a phrase like

lets match força, but stop before that comma

I want to match only lets match força

EDIT 4: So I tried to use Python 3 just for this one script:

>>> re.search(re.compile('[\w ]+', re.UNICODE),'lets match força, but stop before that comma').group(0)
'lets match força'

I guess it works for the most part in Python 3, except that it also matches numbers (which I definitely don't want) and underscores. Any way to fix this, in Python 2 or 3?

like image 645
wrongusername Avatar asked Nov 07 '12 01:11

wrongusername


1 Answers

It's not clear which python version you are using. if you use 2.x then you maybe have an unicode issue. see this post for further pointers and feel free to update your question to elaborate further.

Im quite surprissed, that i can't convert the accented character to proper unicode representation...

but there are workaround:

re.search(re.compile('((\w+\s)|(\w+\W+\w+\s))+', re.UNICODE), ur'Please work blåbær and NOW stop 123').group(0)

or

re.search(re.compile('\D+', re.UNICODE), ur'Please work blåbær and NOW stop 123').group(0)
like image 105
Don Question Avatar answered Nov 15 '22 05:11

Don Question