Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best strategy for splitting English-style names into first and last name

I've got a list of names and I need to split them up into first and last names. Since some names have 2-3 spaces in them, a simple split for a space won't do.

What sort of heuristics do people use to perform the split?

Note that this isn't a duplicate of questions that effectively ask how to split at a space; I'm looking for heuristics and algorithms, not actual code help.

Update: I'm limiting the problem set to English-style names. This is all I need to solve and likely all that anyone approaching this (English language) question will need as well.

like image 422
David Pfeffer Avatar asked Nov 03 '12 14:11

David Pfeffer


2 Answers

I've read a very interesting and comprehensive post on this subject:

http://www.w3.org/International/questions/qa-personal-names

It even suggests to ask yourself whether you really need separate fields for first and last names. It seems to depend on the target region(s) of your application.

like image 95
fan711 Avatar answered Oct 21 '22 00:10

fan711


Two approaches can help, though not fully solve this problem.

  1. Programatically separate the easy ones, the ones that are not easy get pushed into a different list, "remaining to be split". Manually sort that list. As you manually sort, some heuristics might emerge which could be coded, further reducing the size of the remaining list. If this is a one-time thing, and list is not super massive, this will get the job done.
  2. A closely related problem is when a name is split, but you don't know which is the first and which is last. Some systems work around this problem by doing fuzzy lookups such that if on the first attempt no match is found, flip the first and last name and try again. You didn't say why you need to split the names. If it is to lookup against reference data, consider some kind of similar fuzzy lookup heuristics which allow for trying different splits instead of trying to get the split correct up-front.

Not really an answer, but in this case there really is no perfect answer.

like image 30
SporkInventor Avatar answered Oct 21 '22 02:10

SporkInventor