I want to split names of individuals into multiple strings. I am able to extract the first name and last name quite easily, but I have problems extracting the middle name or names as these are quite different in each scenario.
The data would look like this:
ID| Complete_Name | Type
1 | JERRY, Ben | "I"
2 | VON HELSINKI, Olga | "I"
3 | JENSEN, James Goodboy Dean | "I"
4 | THE COMPANY | "C"
5 | CRUZ, Juan S. de la | "I"
Whereby there are names with only a first and last name and names with something in between or two middle names. How can I extract the middle names from a Pandas dataframe? I can already extract the first and last names.
df = pd.read_csv("list.pip", sep="|")
df["First Name"] =
np.where(df["Type"]=="I",df['Complete_Name'].str.split(',').str.get(1) , df[""])
df["Last Name"] = np.where(df["Type"]=="I",df['Complete_Name'].str.split(' ').str.get(1) , df[""])
The desired results should look like this:
ID| Complete_Name | Type | First Name | Middle Name | Last Name
1 | JERRY, Ben | "I" | Ben | | JERRY
2 | VON HELSINKI, Olga | "I" | Olga | |
3 | JENSEN, James Goodboy Dean | "I" | James | Goodboy Dean| VON HELSINKI
4 | THE COMPANY | "C" | | |
5 | CRUZ, Juan S. de la | "I" | Juan | S. de la | CRUZ
A single str.extract
call will work here:
p = r'^(?P<Last_Name>.*), (?P<First_Name>\S+)\b\s*(?P<Middle_Name>.*)'
u = df.loc[df.Type == "I", 'Complete_Name'].str.extract(p)
pd.concat([df, u], axis=1).fillna('')
ID Complete_Name Type Last_Name First_Name Middle_Name
0 1 JERRY, Ben I JERRY Ben
1 2 VON HELSINKI, Olga I VON HELSINKI Olga
2 3 JENSEN, James Goodboy Dean I JENSEN James Goodboy Dean
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I CRUZ Juan S. de la
Regex Breakdown
^ # Start-of-line
(?P<Last_Name> # First named capture group - Last Name
.* # Match anything until...
)
, # ...we see a comma
\s # whitespace
(?P<First_Name> # Second capture group - First Name
\S+ # Match all non-whitespace characters
)
\b # Word boundary
\s* # Optional whitespace chars (mostly housekeeping)
(?P<Middle_Name> # Third capture group - Zero of more middle names
.* # Match everything till the end of string
)
I think you can do:
# take the complete_name column and split it multiple times
df2 = (df.loc[df['Type'].eq('I'),'Complete_Name'].str
.split(',', expand=True)
.fillna(''))
# remove extra spaces
for x in df2.columns:
df2[x] = [x.strip() for x in df2[x]]
# split the name on first space and join it
df2 = pd.concat([df2[0],df2[1].str.split(' ',1, expand=True)], axis=1)
df2.columns = ['last','first','middle']
# join the data frames
df = pd.concat([df[['ID','Complete_Name']], df2], axis=1)
# rearrange columns - not necessary though
df = df[['ID','Complete_Name','first','middle','last']]
# remove none values
df = df.replace([None], '')
ID Complete_Name Type first middle last
0 1 JERRY, Ben I Ben JERRY
1 2 VON HELSINKI, Olga I Olga VON HELSINKI
2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With