Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting names that include "de", "da", etc. into first, middle, last, etc

I want to split Brazilian names into parts. However there are names like below where "de", "da" (and others) that are not separate parts and they always go with the following word. So normal split doesn't work.

test1 = "Francisco da Sousa Rodrigues" #special split
test2 = "Emiliano Rodrigo Carrasco" #normal split
test3 = "Alberto de Francia" #special split
test4 = "Bruno Rezende" #normal split

My expected output would be:

[Francisco, da Sousa, Rodrigues] #1
[Emiliano, Rodrigo, Carrasco] #2
[Alberto, de Francia] #3
[Bruno, Rezende] #4

For the special cases I tried this pattern:

PATTERN = re.compile(r"\s(?=[da, de, do, dos, das])")
re.split(PATTERN, test1) (...)

but the output is not what I expected:

['Francisco', 'da Sousa Rodrigues'] #1
['Alberto', 'de Francia'] #3

Any idea how to fix it? Is there a way to just use one pattern for both "normal" and "special" case?

like image 989
pawelty Avatar asked Jan 22 '18 13:01

pawelty


3 Answers

Will the names always be written in the "canonical" way, i.e. with every part capitalised except for da, de, do, ...?

In that case, you can use that fact:

>>> import re
>>> for t in (test1, test2, test3, test4):
... print(re.findall(r"(?:[a-z]+ )?[A-Z]\w+", t, re.UNICODE))
['Francisco', 'da Sousa', 'Rodrigues']
['Emiliano', 'Rodrigo', 'Carrasco']
['Alberto', 'de Francia']
['Bruno', 'Rezende']
>>>

The "right" way to do what you want to do (apart from not doing it at all), would be a negative lookbehind: split when on a space that isn't preceeded by any of da, de, do, ... . Sadly, this is (AFAIK) impossible, because re requires lookbehinds to be of equal width. If no names end in the syllables, which you really can't assume, you could do this:

PATTERN = re.compile(r"(?<! da| de| do|dos|das)\s")

You may or may not occasionally stumble about cases that don't work: If the first letter is an accented character (or the article, hypothetically, contained one), it will match incorrectly. To fix this, you won't get around using an external library; regex.

Your new findall will look like this then:

regex.findall(r"(?:\p{Ll}+ )?\p{Lu}\w+", "Luiz Ângelo de Urzêda")

The \p{Ll} refers to any lowercase letter, and \p{Lu} to any uppercase letter.

like image 75
L3viathan Avatar answered Oct 15 '22 01:10

L3viathan


With regex.split() function from python's regex library which offers additional functionality:

installation:

pip install regex

usage:

import regex as re

test_names = ["Francisco da Sousa Rodrigues", "Emiliano Rodrigo Carrasco",
              "Alberto de Francia", "Bruno Rezende"]

for n in test_names:
    print(re.split(r'(?<!das?|de|dos?)\s+', n))

The output:

['Francisco', 'da Sousa', 'Rodrigues']
['Emiliano', 'Rodrigo', 'Carrasco']
['Alberto', 'de Francia']
['Bruno', 'Rezende']

  • (?<!das?|de|dos?)\s+ - lookbehind negative assertion (?<!...) ensures that whitespace(s) \s+ is not preceded with one of the special cases da|das|de|do|dos

https://pypi.python.org/pypi/regex/

like image 2
RomanPerekhrest Avatar answered Oct 15 '22 02:10

RomanPerekhrest


You may use this regex in findall with an optional group:

(?:(?:da|de|do|dos|das)\s+)?\S+

Here we make (?:da|de|do|dos|das) and 1+ whitespace following this optional.

RegEx Demo

Code Demo

Code Example:

test1 = "Francisco da Sousa Rodrigues" #special split
test2 = "Emiliano Rodrigo Carrasco" #normal split
test3 = "Alberto de Francia" #special split
test4 = "Bruno Rezende" #normal split

PATTERN = re.compile(r'(?:(?:da|de|do|dos|das)\s+)?\S+')

>>> print re.findall(PATTERN, test1)
['Francisco', 'da Sousa', 'Rodrigues']

>>> print re.findall(PATTERN, test2)
['Emiliano', 'Rodrigo', 'Carrasco']

>>> print re.findall(PATTERN, test3)
['Alberto', 'de Francia']

>>> print re.findall(PATTERN, test4)
['Bruno', 'Rezende']
like image 2
anubhava Avatar answered Oct 15 '22 00:10

anubhava