So one of my major pain points is name comprehension and piecing together household names & titles. I have a 80% solution with a pretty massive regex I put together this morning that I probably shouldn't be proud of (but am anyway in a kind of sick way) that matches the following examples correctly:
John Jeffries
John Jeffries, M.D.
John Jeffries, MD
John Jeffries and Jim Smith
John and Jim Jeffries
John Jeffries & Jennifer Wilkes-Smith, DDS, MD
John Jeffries, CPA & Jennifer Wilkes-Smith, DDS, MD
John Jeffries, C.P.A & Jennifer Wilkes-Smith, DDS, MD
John Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD
John Jeffries M.D. and Jennifer Holmes CPA
John Jeffries M.D. & Jennifer Holmes CPA
The regex matcher looks like this:
(?P<first_name>\S*\s*)?(?!and\s|&\s)(?P<last_name>[\w-]*\s*)(?P<titles1>,?\s*(?!and\s|&\s)[\w\.]*,*\s*(?!and\s|&\s)[\w\.]*)?(?P<connector>\sand\s|\s*&*\s*)?(?!and\s|&\s)(?P<first_name2>\S*\s*)(?P<last_name2>[\w-]*\s*)?(?P<titles2>,?\s*[\w\.]*,*\s*[\w\.]*)?
(wtf right?)
For convenience: http://www.pyregex.com/
So, for the example:
'John Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD'
the regex results in a group dict that looks like:
connector: &
first_name: John
first_name2: Jennifer
last_name: Jeffries
last_name2: Wilkes-Smith
titles1: , C.P.A., MD
titles2: , DDS, MD
I need help with the final step that has been tripping me up, comprehending possible middle names.
Examples include:
'John Jimmy Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD'
'John Jeffries, C.P.A., MD & Jennifer Jenny Wilkes-Smith, DDS, MD'
Is this possible and is there a better way to do this without machine learning? Maybe I can use nameparser (discovered after I went down the regex rabbit hole) instead with some way to determine whether or not there are multiple names? The above matches 99.9% of my cases so I feel like it's worth finishing.
TLDR: I can't figure out if I can use some sort of lookahead or lookbehind to make sure that the possible middle name only matches if there is a last name after it.
Note: I don't need to parse titles like Mr. Mrs. Ms., etc., but I suppose that can be added in the same manner as middle names.
Solution Notes: First, follow Richard's advice and don't do this. Second, investigate NLTK or use/contribute to nameparser for a more robust solution if necessary.
Since then, regexes have appeared in many programming languages, editors, and other tools as a means of determining whether a string matches a specified pattern. Python, Java, and Perl all support regex functionality, as do most Unix tools and many text editors.
Name parsing consists of separating names into their given name and surname components and identifying titles and qualifiers, such as Mr. and Jr. You parse names as one of the first steps to scoring names to increase the likelihood that each name component is analyzed correctly.
Regular expressions (regex) are essentially text patterns that you can use to automate searching through and replacing elements within strings of text.
Python String find() method. Python String find() method returns the lowest index or first occurrence of the substring if it is found in a given string. If it is not found, then it returns -1.
Regular expressions like this are the work of the Dark One.
Who, looking at your code later, will be able to understand what is going on? Will you even?
How will you test all of the possible edge cases?
Why have you chosen to use a regular expression at all? If the tool you are using is so difficult to work with, it suggests that maybe another tool would be better.
Try this:
import re
examples = [
"John Jeffries",
"John Jeffries, M.D.",
"John Jeffries, MD",
"John Jeffries and Jim Smith",
"John and Jim Jeffries",
"John Jeffries & Jennifer Wilkes-Smith, DDS, MD",
"John Jeffries, CPA & Jennifer Wilkes-Smith, DDS, MD",
"John Jeffries, C.P.A & Jennifer Wilkes-Smith, DDS, MD",
"John Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD",
"John Jeffries M.D. and Jennifer Holmes CPA",
"John Jeffries M.D. & Jennifer Holmes CPA",
'John Jimmy Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD',
'John Jeffries, C.P.A., MD & Jennifer Jenny Wilkes-Smith, DDS, MD'
]
def IsTitle(inp):
return re.match('^([A-Z]\.?)+$',inp.strip())
def ParseName(name):
#Titles are separated from each other and from names with ","
#We don't need these, so we remove them
name = name.replace(',',' ')
#Split name and titles on spaces, combining adjacent spaces
name = name.split()
#Build an output object
ret_name = {"first":None, "middle":None, "last":None, "titles":[]}
#First string is always a first name
ret_name['first'] = name[0]
if len(name)>2: #John Johnson Smith/PhD
if IsTitle(name[2]): #John Smith PhD
ret_name['last'] = name[1]
ret_name['titles'] = name[2:]
else: #John Johnson Smith, PhD, MD
ret_name['middle'] = name[1]
ret_name['last'] = name[2]
ret_name['titles'] = name[3:]
elif len(name) == 2: #John Johnson
ret_name['last'] = name[1]
return ret_name
def CombineNames(inp):
if not inp[0]['last']:
inp[0]['last'] = inp[1]['last']
def ParseString(inp):
inp = inp.replace("&","and") #Names are combined with "&" or "and"
inp = re.split("\s+and\s+",inp) #Split names apart
inp = map(ParseName,inp)
CombineNames(inp)
return inp
for e in examples:
print e
print ParseString(e)
Output:
John Jeffries
[{'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'John'}]
John Jeffries, M.D.
[{'middle': None, 'titles': ['M.D.'], 'last': 'Jeffries', 'first': 'John'}]
John Jeffries, MD
[{'middle': None, 'titles': ['MD'], 'last': 'Jeffries', 'first': 'John'}]
John Jeffries and Jim Smith
[{'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': [], 'last': 'Smith', 'first': 'Jim'}]
John and Jim Jeffries
[{'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'Jim'}]
John Jeffries & Jennifer Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': [], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries, CPA & Jennifer Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': ['CPA'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries, C.P.A & Jennifer Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': ['C.P.A'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': ['C.P.A.', 'MD'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries M.D. and Jennifer Holmes CPA
[{'middle': None, 'titles': ['M.D.'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['CPA'], 'last': 'Holmes', 'first': 'Jennifer'}]
John Jeffries M.D. & Jennifer Holmes CPA
[{'middle': None, 'titles': ['M.D.'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['CPA'], 'last': 'Holmes', 'first': 'Jennifer'}]
John Jimmy Jeffries, C.P.A., MD & Jennifer Wilkes-Smith, DDS, MD
[{'middle': 'Jimmy', 'titles': ['C.P.A.', 'MD'], 'last': 'Jeffries', 'first': 'John'}, {'middle': None, 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
John Jeffries, C.P.A., MD & Jennifer Jenny Wilkes-Smith, DDS, MD
[{'middle': None, 'titles': ['C.P.A.', 'MD'], 'last': 'Jeffries', 'first': 'John'}, {'middle': 'Jenny', 'titles': ['DDS', 'MD'], 'last': 'Wilkes-Smith', 'first': 'Jennifer'}]
This took less than fifteen minutes and, at each stage, the logic is clear and the program can be debugged in pieces. While one-liners are cute, clarity and testability should take precedence.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With