Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python-re: How do I match an alpha character

How can I match an alpha character with a regular expression. I want a character that is in \w but is not in \d. I want it unicode compatible that's why I cannot use [a-zA-Z].

like image 526
basaundi Avatar asked Jan 10 '10 23:01

basaundi


People also ask

How do you match a character in Python?

Using m option allows it to match newline as well. Matches any single character in brackets. Matches 0 or more occurrences of preceding expression. Matches 1 or more occurrence of preceding expression.

What does match () do in Python?

match() function of re in Python will search the regular expression pattern and return the first occurrence. The Python RegEx Match method checks for a match only at the beginning of the string. So, if a match is found in the first line, it returns the match object.

What does * do in regex?

The Match-zero-or-more Operator ( * ) This operator repeats the smallest possible preceding regular expression as many times as necessary (including zero) to match the pattern. `*' represents this operator. For example, `o*' matches any string made up of zero or more `o' s.

How do you write alphanumeric in regex?

The regex \w is equivalent to [A-Za-z0-9_] , matches alphanumeric characters and underscore.


2 Answers

Your first two sentences contradict each other. "in \w but is not in \d" includes underscore. I'm assuming from your third sentence that you don't want underscore.

Using a Venn diagram on the back of an envelope helps. Let's look at what we DON'T want:

(1) characters that are not matched by \w (i.e. don't want anything that's not alpha, digits, or underscore) => \W
(2) digits => \d
(3) underscore => _

So what we don't want is anything in the character class [\W\d_] and consequently what we do want is anything in the character class [^\W\d_]

Here's a simple example (Python 2.6).

>>> import re >>> rx = re.compile("[^\W\d_]+", re.UNICODE) >>> rx.findall(u"abc_def,k9") [u'abc', u'def', u'k'] 

Further exploration reveals a few quirks of this approach:

>>> import unicodedata as ucd >>> allsorts =u"\u0473\u0660\u06c9\u24e8\u4e0a\u3020\u3021" >>> for x in allsorts: ...     print repr(x), ucd.category(x), ucd.name(x) ... u'\u0473' Ll CYRILLIC SMALL LETTER FITA u'\u0660' Nd ARABIC-INDIC DIGIT ZERO u'\u06c9' Lo ARABIC LETTER KIRGHIZ YU u'\u24e8' So CIRCLED LATIN SMALL LETTER Y u'\u4e0a' Lo CJK UNIFIED IDEOGRAPH-4E0A u'\u3020' So POSTAL MARK FACE u'\u3021' Nl HANGZHOU NUMERAL ONE >>> rx.findall(allsorts) [u'\u0473', u'\u06c9', u'\u4e0a', u'\u3021'] 

U+3021 (HANGZHOU NUMERAL ONE) is treated as numeric (hence it matches \w) but it appears that Python interprets "digit" to mean "decimal digit" (category Nd) so it doesn't match \d

U+2438 (CIRCLED LATIN SMALL LETTER Y) doesn't match \w

All CJK ideographs are classed as "letters" and thus match \w

Whether any of the above 3 points are a concern or not, that approach is the best you will get out of the re module as currently released. Syntax like \p{letter} is in the future.

like image 186
John Machin Avatar answered Oct 05 '22 02:10

John Machin


What about:

\p{L} 

You can to use this document as reference: Unicode Regular Expressions

EDIT: Seems Python doesn't handle Unicode expressions. Take a look into this link: Handling Accented Characters with Python Regular Expressions -- [A-Z] just isn't good enough (no longer active, link to internet archive)

Another references:

  • re.UNICODE
  • python and regular expression with unicode
  • Unicode Technical Standard #18: Unicode Regular Expressions

For posterity, here are the examples on the blog:

import re string = 'riché' print string riché  richre = re.compile('([A-z]+)') match = richre.match(string) print match.groups() ('rich',)  richre = re.compile('(\w+)',re.LOCALE) match = richre.match(string) print match.groups() ('rich',)  richre = re.compile('([é\w]+)') match = richre.match(string) print match.groups() ('rich\xe9',)  richre = re.compile('([\xe9\w]+)') match = richre.match(string) print match.groups() ('rich\xe9',)  richre = re.compile('([\xe9-\xf8\w]+)') match = richre.match(string) print match.groups() ('rich\xe9',)  string = 'richéñ' match = richre.match(string) print match.groups() ('rich\xe9\xf1',)  richre = re.compile('([\u00E9-\u00F8\w]+)') print match.groups() ('rich\xe9\xf1',)  matched = match.group(1) print matched richéñ 
like image 40
Rubens Farias Avatar answered Oct 05 '22 02:10

Rubens Farias