I'm trying to catch firstnames by making the assumtion that they are on the form Firstname Lastlame
. This works good with the code below, but I would like to be able to catch international names like Pär Åberg
. I found some solutions but they does unfortunally not seem to work with Python flavoured regexp. Anyone with insighs to this?
#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
text = """
This is a text containing names of people in the text such as
Hillary Clinton or Barack Obama. My problem is with names that uses stuff
outside A-Z like Swedish names such as Pär Åberg."""
for name in re.findall("(([A-Z])[\w-]*(\s+[A-Z][\w-]*)+)", text):
firstname = name[0].split()[0]
print firstname
You need an alternative regex library as there you can use \p{L}
- any Unicode letter.
Then, use
ur'\p{Lu}[\w-]*(?:\s+\p{Lu}[\w-]*)+'
When using a Unicode string to initialize regex, the UNICODE
flag is used automatically:
If neither the
ASCII
,LOCALE
norUNICODE
flag is specified, it will default toUNICODE
if the regex pattern is a Unicode string andASCII
if it’s a bytestring.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With