Splitting names that include "de", "da", etc. into first, middle, last, etc

Question

I want to split Brazilian names into parts. However there are names like below where "de", "da" (and others) that are not separate parts and they always go with the following word. So normal split doesn't work.

test1 = "Francisco da Sousa Rodrigues" #special split
test2 = "Emiliano Rodrigo Carrasco" #normal split
test3 = "Alberto de Francia" #special split
test4 = "Bruno Rezende" #normal split

My expected output would be:

[Francisco, da Sousa, Rodrigues] #1
[Emiliano, Rodrigo, Carrasco] #2
[Alberto, de Francia] #3
[Bruno, Rezende] #4

For the special cases I tried this pattern:

PATTERN = re.compile(r"\s(?=[da, de, do, dos, das])")
re.split(PATTERN, test1) (...)

but the output is not what I expected:

['Francisco', 'da Sousa Rodrigues'] #1
['Alberto', 'de Francia'] #3

Any idea how to fix it? Is there a way to just use one pattern for both "normal" and "special" case?

L3viathan · Accepted Answer

Will the names always be written in the "canonical" way, i.e. with every part capitalised except for da, de, do, ...?

In that case, you can use that fact:

>>> import re
>>> for t in (test1, test2, test3, test4):
... print(re.findall(r"(?:[a-z]+ )?[A-Z]\w+", t, re.UNICODE))
['Francisco', 'da Sousa', 'Rodrigues']
['Emiliano', 'Rodrigo', 'Carrasco']
['Alberto', 'de Francia']
['Bruno', 'Rezende']
>>>

The "right" way to do what you want to do (apart from not doing it at all), would be a negative lookbehind: split when on a space that isn't preceeded by any of da, de, do, ... . Sadly, this is (AFAIK) impossible, because re requires lookbehinds to be of equal width. If no names end in the syllables, which you really can't assume, you could do this:

PATTERN = re.compile(r"(?<! da| de| do|dos|das)\s")

You may or may not occasionally stumble about cases that don't work: If the first letter is an accented character (or the article, hypothetically, contained one), it will match incorrectly. To fix this, you won't get around using an external library; regex.

Your new findall will look like this then:

regex.findall(r"(?:\p{Ll}+ )?\p{Lu}\w+", "Luiz Ângelo de Urzêda")

The \p{Ll} refers to any lowercase letter, and \p{Lu} to any uppercase letter.

RomanPerekhrest · Answer

With regex.split() function from python's regex library which offers additional functionality:

installation:

pip install regex

usage:

import regex as re

test_names = ["Francisco da Sousa Rodrigues", "Emiliano Rodrigo Carrasco",
              "Alberto de Francia", "Bruno Rezende"]

for n in test_names:
    print(re.split(r'(?<!das?|de|dos?)\s+', n))

The output:

['Francisco', 'da Sousa', 'Rodrigues']
['Emiliano', 'Rodrigo', 'Carrasco']
['Alberto', 'de Francia']
['Bruno', 'Rezende']

(?<!das?|de|dos?)\s+ - lookbehind negative assertion (?<!...) ensures that whitespace(s) \s+ is not preceded with one of the special cases da|das|de|do|dos

https://pypi.python.org/pypi/regex/

anubhava · Answer

You may use this regex in findall with an optional group:

(?:(?:da|de|do|dos|das)\s+)?\S+

Here we make (?:da|de|do|dos|das) and 1+ whitespace following this optional.

RegEx Demo

Code Demo

Code Example:

test1 = "Francisco da Sousa Rodrigues" #special split
test2 = "Emiliano Rodrigo Carrasco" #normal split
test3 = "Alberto de Francia" #special split
test4 = "Bruno Rezende" #normal split

PATTERN = re.compile(r'(?:(?:da|de|do|dos|das)\s+)?\S+')

>>> print re.findall(PATTERN, test1)
['Francisco', 'da Sousa', 'Rodrigues']

>>> print re.findall(PATTERN, test2)
['Emiliano', 'Rodrigo', 'Carrasco']

>>> print re.findall(PATTERN, test3)
['Alberto', 'de Francia']

>>> print re.findall(PATTERN, test4)
['Bruno', 'Rezende']

Splitting names that include "de", "da", etc. into first, middle, last, etc

Tags:

python

regex

python-3.x

pawelty

3 Answers

L3viathan

RomanPerekhrest

anubhava

Recent Activity

Donate For Us

Splitting names that include "de", "da", etc. into first, middle, last, etc

Tags:

python

regex

python-3.x

pawelty

3 Answers

L3viathan

RomanPerekhrest

anubhava

Related questions

Recent Activity

Donate For Us