Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regex exclude Underscore

Tags:

python

regex

I need to find all two-char sumbols in UNICODE, except underscore. Current solutin is:

pattern = re.compile(ur'(?:\s*)(\w{2})(?:\s*)', re.UNICODE | re.MULTILINE | re.DOTALL)
print pattern.findall('a b c ab cd vs sd a a_ _r')
['ab', 'cd', 'vs', 'sd', 'a_', '_r']

I need to exclude underscore _ from regex, so a_ AND _r are not found. The problem is, my characters can be in any language. So i can't use regex like this: [^a-zA-Z]. For example, in russian:

print pattern.findall(u'ф_')
like image 517
artyomboyko Avatar asked Sep 25 '12 19:09

artyomboyko


2 Answers

Exclude anything that's a non-word char AND _

[^\W_]

instead of

\w
like image 136
Ioan Alexandru Cucu Avatar answered Oct 12 '22 05:10

Ioan Alexandru Cucu


Your best bet would be to use the new regex module instead. One of it's features is that it can remove characters from a character set:

import regex as re

pattern = re.compile(ur'(?:\s*)([\w--_]{2})(?:\s*)', re.UNICODE | re.MULTILINE | re.DOTALL)

The [\w--_] syntax creates a character set that is the same as \w with the underscore character removed from the matching characters.

like image 35
Martijn Pieters Avatar answered Oct 12 '22 03:10

Martijn Pieters