Matching names on form Firstname Lastname with international characters

Question

I'm trying to catch firstnames by making the assumtion that they are on the form Firstname Lastlame. This works good with the code below, but I would like to be able to catch international names like Pär Åberg. I found some solutions but they does unfortunally not seem to work with Python flavoured regexp. Anyone with insighs to this?

#!/usr/bin/python
# -*- coding: utf-8 -*- 
import re

text = """
This is a text containing names of people in the text such as 
Hillary Clinton or Barack Obama. My problem is with names that uses stuff 
outside A-Z like Swedish names such as Pär Åberg."""

for name in re.findall("(([A-Z])[\w-]*(\s+[A-Z][\w-]*)+)", text):
    firstname = name[0].split()[0]
    print firstname

Wiktor Stribiżew · Accepted Answer

You need an alternative regex library as there you can use \p{L} - any Unicode letter.

Then, use

ur'\p{Lu}[\w-]*(?:\s+\p{Lu}[\w-]*)+'

When using a Unicode string to initialize regex, the UNICODE flag is used automatically:

If neither the ASCII, LOCALE nor UNICODE flag is specified, it will default to UNICODE if the regex pattern is a Unicode string and ASCII if it’s a bytestring.

Matching names on form Firstname Lastname with international characters

Tags:

python

regex

cowboyvspirate

1 Answers

Wiktor Stribiżew

Recent Activity

Donate For Us

Matching names on form Firstname Lastname with international characters

Tags:

python

regex

cowboyvspirate

1 Answers

Wiktor Stribiżew

Related questions

Recent Activity

Donate For Us