How can I match an alpha character with a regular expression. I want a character that is in \w
but is not in \d
. I want it unicode compatible that's why I cannot use [a-zA-Z]
.
Using m option allows it to match newline as well. Matches any single character in brackets. Matches 0 or more occurrences of preceding expression. Matches 1 or more occurrence of preceding expression.
match() function of re in Python will search the regular expression pattern and return the first occurrence. The Python RegEx Match method checks for a match only at the beginning of the string. So, if a match is found in the first line, it returns the match object.
The Match-zero-or-more Operator ( * ) This operator repeats the smallest possible preceding regular expression as many times as necessary (including zero) to match the pattern. `*' represents this operator. For example, `o*' matches any string made up of zero or more `o' s.
The regex \w is equivalent to [A-Za-z0-9_] , matches alphanumeric characters and underscore.
Your first two sentences contradict each other. "in \w
but is not in \d
" includes underscore. I'm assuming from your third sentence that you don't want underscore.
Using a Venn diagram on the back of an envelope helps. Let's look at what we DON'T want:
(1) characters that are not matched by \w
(i.e. don't want anything that's not alpha, digits, or underscore) => \W
(2) digits => \d
(3) underscore => _
So what we don't want is anything in the character class [\W\d_]
and consequently what we do want is anything in the character class [^\W\d_]
Here's a simple example (Python 2.6).
>>> import re >>> rx = re.compile("[^\W\d_]+", re.UNICODE) >>> rx.findall(u"abc_def,k9") [u'abc', u'def', u'k']
Further exploration reveals a few quirks of this approach:
>>> import unicodedata as ucd >>> allsorts =u"\u0473\u0660\u06c9\u24e8\u4e0a\u3020\u3021" >>> for x in allsorts: ... print repr(x), ucd.category(x), ucd.name(x) ... u'\u0473' Ll CYRILLIC SMALL LETTER FITA u'\u0660' Nd ARABIC-INDIC DIGIT ZERO u'\u06c9' Lo ARABIC LETTER KIRGHIZ YU u'\u24e8' So CIRCLED LATIN SMALL LETTER Y u'\u4e0a' Lo CJK UNIFIED IDEOGRAPH-4E0A u'\u3020' So POSTAL MARK FACE u'\u3021' Nl HANGZHOU NUMERAL ONE >>> rx.findall(allsorts) [u'\u0473', u'\u06c9', u'\u4e0a', u'\u3021']
U+3021 (HANGZHOU NUMERAL ONE) is treated as numeric (hence it matches \w) but it appears that Python interprets "digit" to mean "decimal digit" (category Nd) so it doesn't match \d
U+2438 (CIRCLED LATIN SMALL LETTER Y) doesn't match \w
All CJK ideographs are classed as "letters" and thus match \w
Whether any of the above 3 points are a concern or not, that approach is the best you will get out of the re module as currently released. Syntax like \p{letter} is in the future.
What about:
\p{L}
You can to use this document as reference: Unicode Regular Expressions
EDIT: Seems Python doesn't handle Unicode expressions. Take a look into this link: Handling Accented Characters with Python Regular Expressions -- [A-Z] just isn't good enough (no longer active, link to internet archive)
Another references:
For posterity, here are the examples on the blog:
import re string = 'riché' print string riché richre = re.compile('([A-z]+)') match = richre.match(string) print match.groups() ('rich',) richre = re.compile('(\w+)',re.LOCALE) match = richre.match(string) print match.groups() ('rich',) richre = re.compile('([é\w]+)') match = richre.match(string) print match.groups() ('rich\xe9',) richre = re.compile('([\xe9\w]+)') match = richre.match(string) print match.groups() ('rich\xe9',) richre = re.compile('([\xe9-\xf8\w]+)') match = richre.match(string) print match.groups() ('rich\xe9',) string = 'richéñ' match = richre.match(string) print match.groups() ('rich\xe9\xf1',) richre = re.compile('([\u00E9-\u00F8\w]+)') print match.groups() ('rich\xe9\xf1',) matched = match.group(1) print matched richéñ
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With