Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python 3 regex with diacritics and ligatures,

Names in the form: Ceasar, Julius are to be split into First_name Julius Surname Ceasar.

Names may contain diacritics (á à é ..), and ligatures (æ, ø)

This code seems to work OK in Python 3.3

import re

def doesmatch(pat, str):  
    try: 
        yup = re.search(pat, str)
        print('Firstname {0} lastname {1}'.format(yup.group(2), yup.group(1)))
    except AttributeError:
        print('no match for {0}'.format(str))

s = 'Révèrberë, Harry'
t = 'Åapö, Renée'
u = 'C3po, Robby'
v = 'Mærsk, Efraïm'
w = 'MacDønald, Ron'
x = 'Sträßle, Mpopo'

pat = r'^([^\d\s]+), ([^\d\s]+)'
# matches any letter, diacritic or ligature, but not digits or punctuation inside the () 

for i in s, t, u, v, w, x:
    doesmatch(pat, i)

All except u match. (no match for numbers in names), but I wonder if there isn't a better way than the non-digit non-space approach. More important though: I'd like to refine the pattern so it distinquishes capitals from lowercase letters, but including capital diacritics and ligatures, preferably using regex also. As if ([A-Z][a-z]+), would match accented and combined characters.

Is this possible?

(what I've looked at so far: Dive into python 3 on UTF-8 vs Unicode; This Regex tutorial on Unicode (which I'm not using); I think I don't need new regex but I admit I haven't read all its documentation)

like image 423
RolfBly Avatar asked Apr 10 '13 21:04

RolfBly


1 Answers

If you want to distinguish uppercase and lowercase letters using the standard library's re module, then I'm afraid you'll have to build a character class of all the relevant Unicode codepoints manually.

If you don't really need to do this, use

[^\W\d_]

to match any Unicode letter. This character class matches anything that's "not a non-alphanumeric character" (which is the same as "an alphanumeric character") that's also not a digit nor an underscore.

like image 126
Tim Pietzcker Avatar answered Nov 04 '22 06:11

Tim Pietzcker