Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

capturing complex names

Tags:

regex

r

my data:

Caterina Guonçallvez braçeyro 
Francisco Ro[dr]í[gueJz luveyro
Johao de Miranda calçeteyro 
Lucas Martinz Mal-Cuzinhado, braçeyro 
Francisquo d[e] Arruda braçeyro 
Francisquo de Miranda braçeyro 

-first name last name
-first name last name with brakets and J (brakets ocr recognition)
-first name last name with hyphen
-first name last name with particle
-first name last name with particle with brakets

Expected output

Caterina Guonçallvez
Francisco Ro[dr]í[gueJz
Johao de Miranda
Lucas Martinz Mal-Cuzinhado
Francisquo d[e] Arruda
Francisquo de Miranda
  • Names are begining with uppercases

  • The last part of the name is followed by a space (or comma with space) and a word beginning with a lowercase character like "braçeyro" or "calçeteyro" (people's jobs)

    data <- readLines("clipboard" , encoding = "latin1")

What I tried:

^([a-zA-ZàáâäãåąčćęèéêëėįìíîïłńòóôöõøùúûüųūÿýżźñçčšžÀÁÂÄÃÅĄĆČĖĘÈÉÊËÌÍÎÏĮŁŃÒÓÔÖÕØÙÚÛÜŲŪŸÝŻŹÑßÇŒÆČŠŽ∂ð])\w+[A-Z ,.'-]\w+

giving
Antonio Guomez
Caterina Guon
Francisco Ro
Johao de
Francisquo d

like image 462
Wilcar Avatar asked May 21 '16 14:05

Wilcar


1 Answers

The pattern (([A-Z][\w\[\]-]+|de|d\[e\])\s?)+ returns:

'Caterina Guonçallvez '
'Francisco Ro[dr]í[gueJz '
'Johao de Miranda '
'Lucas Martinz Mal-Cuzinhado'
'Francisquo d[e] Arruda '
'Francisquo de Miranda '

This assumes you set your locale correctly.

The regex matches groups of letters (and hyphens), starting with an uppercase one, or "de", followed by an optional space. This means that you will need to strip the strings to remove trailing spaces.


edit: Proof it works in R:

> Sys.setlocale("LC_ALL","en_us.UTF-8")
> library(stringr)
> x <- "Caterina Guonçallvez braçeyro "
> str_match(x, '(([A-Z][\\w\\[\\]-]+|de|d\\[e\\])\\s?)+')
     [,1]                    [,2]           [,3]         
[1,] "Caterina Guonçallvez " "Guonçallvez " "Guonçallvez"
like image 197
L3viathan Avatar answered Nov 10 '22 04:11

L3viathan