Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I use python 're' to parse complex human names?

So one of my major pain points is name comprehension and piecing together household names & titles. I have a 80% solution with a pretty massive regex I put together this morning that I probably shouldn't be proud of (but am anyway in a kind of sick way) that matches the following examples correctly:

John Jeffries
John Jeffries, M.D.
John Jeffries, MD
John Jeffries and Jim Smith
John and Jim Jeffries
John Jeffries & Jennifer Wilkes-Smith, DDS, MD
John Jeffries, CPA & Jennifer Wilkes-Smith, DDS, MD
John Jeffries, C.P.A & Jennifer Wilkes-Smith, DDS, MD
John Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD
John Jeffries M.D. and Jennifer Holmes CPA
John Jeffries M.D. & Jennifer Holmes CPA

The regex matcher looks like this:

(?P<first_name>\S*\s*)?(?!and\s|&\s)(?P<last_name>[\w-]*\s*)(?P<titles1>,?\s*(?!and\s|&\s)[\w\.]*,*\s*(?!and\s|&\s)[\w\.]*)?(?P<connector>\sand\s|\s*&*\s*)?(?!and\s|&\s)(?P<first_name2>\S*\s*)(?P<last_name2>[\w-]*\s*)?(?P<titles2>,?\s*[\w\.]*,*\s*[\w\.]*)?

(wtf right?)

For convenience: http://www.pyregex.com/

So, for the example:

'John Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD'

the regex results in a group dict that looks like:

connector: &
first_name: John
first_name2: Jennifer
last_name: Jeffries
last_name2: Wilkes-Smith
titles1: , C.P.A., MD
titles2: , DDS, MD

I need help with the final step that has been tripping me up, comprehending possible middle names.

Examples include:

'John Jimmy Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD'
'John Jeffries, C.P.A., MD & Jennifer Jenny Wilkes-Smith, DDS, MD'

Is this possible and is there a better way to do this without machine learning? Maybe I can use nameparser (discovered after I went down the regex rabbit hole) instead with some way to determine whether or not there are multiple names? The above matches 99.9% of my cases so I feel like it's worth finishing.

TLDR: I can't figure out if I can use some sort of lookahead or lookbehind to make sure that the possible middle name only matches if there is a last name after it.

Note: I don't need to parse titles like Mr. Mrs. Ms., etc., but I suppose that can be added in the same manner as middle names.

Solution Notes: First, follow Richard's advice and don't do this. Second, investigate NLTK or use/contribute to nameparser for a more robust solution if necessary.

like image 245
mzniko Avatar asked Feb 25 '15 20:02

mzniko


People also ask

Can you do regex in Python?

Since then, regexes have appeared in many programming languages, editors, and other tools as a means of determining whether a string matches a specified pattern. Python, Java, and Perl all support regex functionality, as do most Unix tools and many text editors.

How do you parse a name?

Name parsing consists of separating names into their given name and surname components and identifying titles and qualifiers, such as Mr. and Jr. You parse names as one of the first steps to scoring names to increase the likelihood that each name component is analyzed correctly.

What is regex true in Python?

Regular expressions (regex) are essentially text patterns that you can use to automate searching through and replacing elements within strings of text.

How do you search names in Python?

Python String find() method. Python String find() method returns the lowest index or first occurrence of the substring if it is found in a given string. If it is not found, then it returns -1.


1 Answers

Regular expressions like this are the work of the Dark One.

Who, looking at your code later, will be able to understand what is going on? Will you even?

How will you test all of the possible edge cases?

Why have you chosen to use a regular expression at all? If the tool you are using is so difficult to work with, it suggests that maybe another tool would be better.

Try this:

import re

examples = [
  "John Jeffries",
  "John Jeffries, M.D.",
  "John Jeffries, MD",
  "John Jeffries and Jim Smith",
  "John and Jim Jeffries",
  "John Jeffries & Jennifer Wilkes-Smith, DDS, MD",
  "John Jeffries, CPA & Jennifer Wilkes-Smith, DDS, MD",
  "John Jeffries, C.P.A & Jennifer Wilkes-Smith, DDS, MD",
  "John Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD",
  "John Jeffries M.D. and Jennifer Holmes CPA",
  "John Jeffries M.D. & Jennifer Holmes CPA",
  'John Jimmy Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD',
  'John Jeffries, C.P.A., MD & Jennifer Jenny Wilkes-Smith, DDS, MD'
]

def IsTitle(inp):
  return re.match('^([A-Z]\.?)+$',inp.strip())

def ParseName(name):
  #Titles are separated from each other and from names with ","
  #We don't need these, so we remove them
  name = name.replace(',',' ') 
  #Split name and titles on spaces, combining adjacent spaces
  name = name.split()
  #Build an output object
  ret_name = {"first":None, "middle":None, "last":None, "titles":[]}
  #First string is always a first name
  ret_name['first'] = name[0]
  if len(name)>2: #John Johnson Smith/PhD
    if IsTitle(name[2]): #John Smith PhD
      ret_name['last']   = name[1]
      ret_name['titles'] = name[2:]
    else:                #John Johnson Smith, PhD, MD
      ret_name['middle'] = name[1]
      ret_name['last']   = name[2]
      ret_name['titles'] = name[3:]
  elif len(name) == 2:   #John Johnson
    ret_name['last'] = name[1]
  return ret_name

def CombineNames(inp):
  if not inp[0]['last']:
    inp[0]['last'] = inp[1]['last']

def ParseString(inp):
  inp = inp.replace("&","and")    #Names are combined with "&" or "and"
  inp = re.split("\s+and\s+",inp) #Split names apart
  inp = map(ParseName,inp)
  CombineNames(inp)
  return inp

for e in examples:
  print e
  print ParseString(e)

Output:

John Jeffries
[{'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'John'}]
John Jeffries, M.D.
[{'middle': None, 'titles': ['M.D.'], 'last': 'Jeffries', 'first': 'John'}]
John Jeffries, MD
[{'middle': None, 'titles': ['MD'], 'last': 'Jeffries', 'first': 'John'}]
John Jeffries and Jim Smith
[{'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': [], 'last': 'Smith', 'first': 'Jim'}]
John and Jim Jeffries
[{'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'Jim'}]
John Jeffries & Jennifer Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries, CPA & Jennifer Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': ['CPA'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries, C.P.A & Jennifer Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': ['C.P.A'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': ['C.P.A.', 'MD'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries M.D. and Jennifer Holmes CPA
[{'middle': None, 'titles': ['M.D.'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['CPA'], 'last': 'Holmes', 'first': 'Jennifer'}]
John Jeffries M.D. & Jennifer Holmes CPA
[{'middle': None, 'titles': ['M.D.'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['CPA'], 'last': 'Holmes', 'first': 'Jennifer'}]
John Jimmy Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD
[{'middle': 'Jimmy', 'titles': ['C.P.A.', 'MD'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries, C.P.A., MD & Jennifer Jenny Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': ['C.P.A.', 'MD'], 'last': 'Jeffries', 'first': 'John'}, {'middle': 'Jenny', 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]

This took less than fifteen minutes and, at each stage, the logic is clear and the program can be debugged in pieces. While one-liners are cute, clarity and testability should take precedence.

like image 67
Richard Avatar answered Oct 28 '22 11:10

Richard