I'm doing a website migration that involves extracting firstname and lastname from fullname. Given these were created by the end user, all kinds of permutations exist (although English and generally not too strange). Mostly I can take the first word as firstname and the last word as the lastname but have some exceptions from the occasional prefix and suffix. In going through the data and trying to get my head around all the likely exceptions I realized that this is a common problem that has been at least partially solved many times before.
Before reinventing the wheel, does anyone have any regular expressions that have worked for them or useful code? Performance is not a consideration as this is a one-time utility.
Typical values to be handled:
Jason Briggs, J.D. Smith, John Y Citizen, J Scott Myers, Bill Jackobson III, Mr. John Mills
Update: while a common problem, the typical solution seems to involve handling the majority of cases and manually cleaning the rest.
(Given the frequency this issue must be experienced I was originally expecting to find a utility library out there but was not able to find one myself with Google)
My recommendation would be the following:
Split the names on the spaces.
Check the length of the returned array. If 2, easy split. If more, next.
Compare the 1st value for prefixes (i.e. Mr. Mrs. Ms. Dr.)...if so, remove it else move to next.
Compare the 1st value for length. If it's just 1 character, combine first 2 items in the array.
It's still not fool proof; however, it should address at least 80 per cent of your cases.
Hope this helps.
It's probably impossible to do (reliably).
Even if you can do that for some names, you will get a Spanish person at some point, who will write down both family names. Or some people (forgot which nationality it is) that will put in "lastname firstname". Or one of many other situations...
The best you can probably do is split 2 words as first and last name, then go through the rest manually (yourself, or hire some professionals)...
The fastest thing to do is a hybrid algorithm-human approach. You don't want to spend the time putting together a system that works 99.99% of the time because the last 5-10% of optimization will kill you. Also, you don't want to just dump all of the work on a person because most of the cases (I'm guessing) are fairly straightforward.
So, rapidly build something like what JamesEggers suggested, but catch all of the cases that appear unusual or do not fit your predefined conversions. Then, simply go through those cases manually (It shouldn't be too many).
You could go through those cases by yourself or outsource them to other users by setting up HITs in Mechanical Turk:
http://aws.amazon.com/mturk/
(Assuming 500 cases at $0.05 (high reward) your total cost should be $25 at most)
If this is a one shot deal then I would strongly consider paying someone else who is a specialist to do it for you.
They will be experienced in working with poorly structured data sets.
I have no affiliation with them but Melissa Data provide a service that seems tailored to this sort of thing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With