my data:
Caterina Guonçallvez braçeyro
Francisco Ro[dr]í[gueJz luveyro
Johao de Miranda calçeteyro
Lucas Martinz Mal-Cuzinhado, braçeyro
Francisquo d[e] Arruda braçeyro
Francisquo de Miranda braçeyro
-first name last name
-first name last name with brakets and J (brakets ocr recognition)
-first name last name with hyphen
-first name last name with particle
-first name last name with particle with brakets
Expected output
Caterina Guonçallvez
Francisco Ro[dr]í[gueJz
Johao de Miranda
Lucas Martinz Mal-Cuzinhado
Francisquo d[e] Arruda
Francisquo de Miranda
Names are begining with uppercases
The last part of the name is followed by a space (or comma with space) and a word beginning with a lowercase character like "braçeyro" or "calçeteyro" (people's jobs)
data <- readLines("clipboard" , encoding = "latin1")
What I tried:
^([a-zA-ZàáâäãåąčćęèéêëėįìíîïłńòóôöõøùúûüųūÿýżźñçčšžÀÁÂÄÃÅĄĆČĖĘÈÉÊËÌÍÎÏĮŁŃÒÓÔÖÕØÙÚÛÜŲŪŸÝŻŹÑßÇŒÆČŠŽ∂ð])\w+[A-Z ,.'-]\w+
giving
Antonio Guomez
Caterina Guon
Francisco Ro
Johao de
Francisquo d
The pattern (([A-Z][\w\[\]-]+|de|d\[e\])\s?)+
returns:
'Caterina Guonçallvez '
'Francisco Ro[dr]í[gueJz '
'Johao de Miranda '
'Lucas Martinz Mal-Cuzinhado'
'Francisquo d[e] Arruda '
'Francisquo de Miranda '
This assumes you set your locale correctly.
The regex matches groups of letters (and hyphens), starting with an uppercase one, or "de", followed by an optional space. This means that you will need to strip the strings to remove trailing spaces.
edit: Proof it works in R:
> Sys.setlocale("LC_ALL","en_us.UTF-8")
> library(stringr)
> x <- "Caterina Guonçallvez braçeyro "
> str_match(x, '(([A-Z][\\w\\[\\]-]+|de|d\\[e\\])\\s?)+')
[,1] [,2] [,3]
[1,] "Caterina Guonçallvez " "Guonçallvez " "Guonçallvez"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With