How can I match an alpha character with a regular expression. I want a character that is in <code>\w</code> but is not in <code>\d</code>. I want it unicode compatible that's why I cannot use <code>[a-zA-Z]</code>.

Your first two sentences contradict each other. "in <code>\w</code> but is not in <code>\d</code>" includes underscore. I'm assuming from your third sentence that you don't want underscore. Using a Venn diagram on the back of an envelope helps. Let's look at what we DON'T want: (1) characters that are not matched by <code>\w</code> (i.e. don't want anything that's not alpha, digits, or underscore) => <code>\W</code> (2) digits => <code>\d</code> (3) underscore => <code>_</code> So what we don't want is anything in the character class <code>[\W\d_]</code> and consequently what we do want is anything in the character class <code>[^\W\d_]</code> Here's a simple example (Python 2.6). <pre class="prettyprint"><code>>>> import re >>> rx = re.compile("[^\W\d_]+", re.UNICODE) >>> rx.findall(u"abc_def,k9") [u'abc', u'def', u'k'] </code></pre> Further exploration reveals a few quirks of this approach: <pre class="prettyprint"><code>>>> import unicodedata as ucd >>> allsorts =u"\u0473\u0660\u06c9\u24e8\u4e0a\u3020\u3021" >>> for x in allsorts: ... print repr(x), ucd.category(x), ucd.name(x) ... u'\u0473' Ll CYRILLIC SMALL LETTER FITA u'\u0660' Nd ARABIC-INDIC DIGIT ZERO u'\u06c9' Lo ARABIC LETTER KIRGHIZ YU u'\u24e8' So CIRCLED LATIN SMALL LETTER Y u'\u4e0a' Lo CJK UNIFIED IDEOGRAPH-4E0A u'\u3020' So POSTAL MARK FACE u'\u3021' Nl HANGZHOU NUMERAL ONE >>> rx.findall(allsorts) [u'\u0473', u'\u06c9', u'\u4e0a', u'\u3021'] </code></pre> U+3021 (HANGZHOU NUMERAL ONE) is treated as numeric (hence it matches \w) but it appears that Python interprets "digit" to mean "decimal digit" (category Nd) so it doesn't match \d U+2438 (CIRCLED LATIN SMALL LETTER Y) doesn't match \w All CJK ideographs are classed as "letters" and thus match \w Whether any of the above 3 points are a concern or not, that approach is the best you will get out of the re module as currently released. Syntax like \p{letter} is in the future.

What about: <pre class="prettyprint"><code>\p{L} </code></pre> You can to use this document as reference: Unicode Regular Expressions EDIT: Seems Python doesn't handle Unicode expressions. Take a look into this link: Handling Accented Characters with Python Regular Expressions -- [A-Z] just isn't good enough (no longer active, link to internet archive) Another references: <ul> <li>re.UNICODE</li> <li>python and regular expression with unicode</li> <li>Unicode Technical Standard #18: Unicode Regular Expressions</li> </ul> <hr> For posterity, here are the examples on the blog: <pre class="prettyprint"><code>import re string = 'richÃ©' print string richÃ© richre = re.compile('([A-z]+)') match = richre.match(string) print match.groups() ('rich',) richre = re.compile('(\w+)',re.LOCALE) match = richre.match(string) print match.groups() ('rich',) richre = re.compile('([Ã©\w]+)') match = richre.match(string) print match.groups() ('rich\xe9',) richre = re.compile('([\xe9\w]+)') match = richre.match(string) print match.groups() ('rich\xe9',) richre = re.compile('([\xe9-\xf8\w]+)') match = richre.match(string) print match.groups() ('rich\xe9',) string = 'richÃ©Ã±' match = richre.match(string) print match.groups() ('rich\xe9\xf1',) richre = re.compile('([\u00E9-\u00F8\w]+)') print match.groups() ('rich\xe9\xf1',) matched = match.group(1) print matched richÃ©Ã± </code></pre>

python-re: How do I match an alpha character

How can I match an alpha character with a regular expression. I want a character that is in \w but is not in \d. I want it unicode compatible that's why I cannot use [a-zA-Z].

How do you match a character in Python?

Using m option allows it to match newline as well. Matches any single character in brackets. Matches 0 or more occurrences of preceding expression. Matches 1 or more occurrence of preceding expression.

What does match () do in Python?

match() function of re in Python will search the regular expression pattern and return the first occurrence. The Python RegEx Match method checks for a match only at the beginning of the string. So, if a match is found in the first line, it returns the match object.

What does * do in regex?

The Match-zero-or-more Operator ( * ) This operator repeats the smallest possible preceding regular expression as many times as necessary (including zero) to match the pattern. `*' represents this operator. For example, `o*' matches any string made up of zero or more `o' s.

How do you write alphanumeric in regex?

The regex \w is equivalent to [A-Za-z0-9_] , matches alphanumeric characters and underscore.

Your first two sentences contradict each other. "in \w but is not in \d" includes underscore. I'm assuming from your third sentence that you don't want underscore.

Using a Venn diagram on the back of an envelope helps. Let's look at what we DON'T want:

(1) characters that are not matched by \w (i.e. don't want anything that's not alpha, digits, or underscore) => \W
(2) digits => \d
(3) underscore => _

So what we don't want is anything in the character class [\W\d_] and consequently what we do want is anything in the character class [^\W\d_]

Here's a simple example (Python 2.6).

>>> import re >>> rx = re.compile("[^\W\d_]+", re.UNICODE) >>> rx.findall(u"abc_def,k9") [u'abc', u'def', u'k']

Further exploration reveals a few quirks of this approach:

>>> import unicodedata as ucd >>> allsorts =u"\u0473\u0660\u06c9\u24e8\u4e0a\u3020\u3021" >>> for x in allsorts: ...     print repr(x), ucd.category(x), ucd.name(x) ... u'\u0473' Ll CYRILLIC SMALL LETTER FITA u'\u0660' Nd ARABIC-INDIC DIGIT ZERO u'\u06c9' Lo ARABIC LETTER KIRGHIZ YU u'\u24e8' So CIRCLED LATIN SMALL LETTER Y u'\u4e0a' Lo CJK UNIFIED IDEOGRAPH-4E0A u'\u3020' So POSTAL MARK FACE u'\u3021' Nl HANGZHOU NUMERAL ONE >>> rx.findall(allsorts) [u'\u0473', u'\u06c9', u'\u4e0a', u'\u3021']

U+3021 (HANGZHOU NUMERAL ONE) is treated as numeric (hence it matches \w) but it appears that Python interprets "digit" to mean "decimal digit" (category Nd) so it doesn't match \d

U+2438 (CIRCLED LATIN SMALL LETTER Y) doesn't match \w

All CJK ideographs are classed as "letters" and thus match \w

Whether any of the above 3 points are a concern or not, that approach is the best you will get out of the re module as currently released. Syntax like \p{letter} is in the future.

What about:

\p{L}

You can to use this document as reference: Unicode Regular Expressions

EDIT: Seems Python doesn't handle Unicode expressions. Take a look into this link: Handling Accented Characters with Python Regular Expressions -- [A-Z] just isn't good enough (no longer active, link to internet archive)

Another references:

re.UNICODE
python and regular expression with unicode
Unicode Technical Standard #18: Unicode Regular Expressions

For posterity, here are the examples on the blog:

import re string = 'richÃ©' print string richÃ©  richre = re.compile('([A-z]+)') match = richre.match(string) print match.groups() ('rich',)  richre = re.compile('(\w+)',re.LOCALE) match = richre.match(string) print match.groups() ('rich',)  richre = re.compile('([Ã©\w]+)') match = richre.match(string) print match.groups() ('rich\xe9',)  richre = re.compile('([\xe9\w]+)') match = richre.match(string) print match.groups() ('rich\xe9',)  richre = re.compile('([\xe9-\xf8\w]+)') match = richre.match(string) print match.groups() ('rich\xe9',)  string = 'richÃ©Ã±' match = richre.match(string) print match.groups() ('rich\xe9\xf1',)  richre = re.compile('([\u00E9-\u00F8\w]+)') print match.groups() ('rich\xe9\xf1',)  matched = match.group(1) print matched richÃ©Ã±

python-re: How do I match an alpha character

Tags:

python

regex

regex-negation

unicode

basaundi

People also ask

2 Answers

John Machin

Rubens Farias

Recent Activity

Donate For Us

python-re: How do I match an alpha character

Tags:

python

regex

regex-negation

unicode

basaundi

People also ask

2 Answers

John Machin

Rubens Farias

Related questions

Recent Activity

Donate For Us