Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching names on form Firstname Lastname with international characters

Tags:

python

regex

I'm trying to catch firstnames by making the assumtion that they are on the form Firstname Lastlame. This works good with the code below, but I would like to be able to catch international names like Pär Åberg. I found some solutions but they does unfortunally not seem to work with Python flavoured regexp. Anyone with insighs to this?

#!/usr/bin/python
# -*- coding: utf-8 -*- 
import re

text = """
This is a text containing names of people in the text such as 
Hillary Clinton or Barack Obama. My problem is with names that uses stuff 
outside A-Z like Swedish names such as Pär Åberg."""

for name in re.findall("(([A-Z])[\w-]*(\s+[A-Z][\w-]*)+)", text):
    firstname = name[0].split()[0]
    print firstname
like image 558
cowboyvspirate Avatar asked Oct 31 '22 15:10

cowboyvspirate


1 Answers

You need an alternative regex library as there you can use \p{L} - any Unicode letter.

Then, use

ur'\p{Lu}[\w-]*(?:\s+\p{Lu}[\w-]*)+'

When using a Unicode string to initialize regex, the UNICODE flag is used automatically:

If neither the ASCII, LOCALE nor UNICODE flag is specified, it will default to UNICODE if the regex pattern is a Unicode string and ASCII if it’s a bytestring.

like image 89
Wiktor Stribiżew Avatar answered Nov 15 '22 04:11

Wiktor Stribiżew