Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Human Name parsing

I have a bunch of human names. They are all "Western" names and I only need American conventions/abbreviations (e.g., Mr. instead of Sr. for señor). Unfortunately, the people to whom I am sending things did not input their own names so I can't ask them what they would like to be called. I know the gender of each person and their full name, but haven't really parsed things out more specifically.

Some examples:

  1. John Smith
  2. John Smith, Jr.
  3. John Smith Jr.
  4. John Smith XIV
  5. Dr. John Smith, Ph.D.

I'd like to be able to parse out parts of each name:

name = Name.new("John Smith Jr.")
name.first_name # <= John
name.greeting   # <= Mr. Smith

If I'm looking for "greeting" (probably not the best term), what I want here is, for 1-4, "Mr. Smith". For 5, I would like Dr. Smith but I'd settle for Mr. Smith.

A Ruby gem for this would be ideal. I was inspired to ask for something this strange by Chronic, a Ruby gem that handles time in a remarkably human way, letting me correctly tell it "last Tuesday" and having it come up with something sensible." Some algorithm would suffice that hits most of the corner cases.

I'm trying to deal with some of the issues presented in falsehoods programmers believe about names

like image 266
Hut8 Avatar asked Jul 03 '13 18:07

Hut8


2 Answers

Since you're limited to Western-style names, I think a few rules will get you most of the way there:

  1. If a comma appears, delete the leftmost one and everything after.
  2. Continue removing words from the beginning while, after converting to lowercase and removing any full stops, they belong to the set { mr mrs miss ms rev dr prof } and any more you can think of. Using a table of title "scores" (e.g. [mr=1, mrs=1, rev=2, dr=3, prof=4] -- order them however you want), record the highest-scoring title that was deleted.
  3. Continue removing words from the end while they belong to the set { jr phd } or are Roman numerals of value roughly 50 or less (/[XVI]+/ is probably a good enough regex).
  4. If one or more titles having nonzero scores were deleted in step 2, use the highest-scoring one. Otherwise, use "Mr." or "Mrs." according to the supplied gender.
  5. As the surname, use the last word.

It will never be possible to guarantee that a name like "John Baxter Smith" is parsed correctly, since not all double-barrelled surnames use hyphens. Is "Baxter Smith" the surname? Or is "Baxter" a middle name? I think it's safe to assume that middle names are relatively more common than double-barrelled-but-unhyphenated surnames, meaning it's better to default to reporting the last word as the surname. You might want to also compile a list of common double-barrelled surnames and check against this, however.

like image 182
j_random_hacker Avatar answered Sep 30 '22 20:09

j_random_hacker


Look on lufthansa page. They ask for them which kind of 'title' they wanna use. I never saw better idea like that.

I don't recommend use gem or whatever in this case because english/spanish/french/.... there are difference on gender, then, if you try discover by yourself, you can't be successful.

I hope help you

like image 33
lucianosousa Avatar answered Sep 30 '22 19:09

lucianosousa