Logo Questions Linux Laravel Mysql Ubuntu Git Menu

How to split a pandas string to extract middle names?

I want to split names of individuals into multiple strings. I am able to extract the first name and last name quite easily, but I have problems extracting the middle name or names as these are quite different in each scenario.

The data would look like this:

ID| Complete_Name               | Type
1 | JERRY, Ben                  | "I"
2 | VON HELSINKI, Olga          | "I"
3 | JENSEN, James Goodboy Dean  | "I"
4 | THE COMPANY                 | "C"
5 | CRUZ, Juan S. de la         | "I"

Whereby there are names with only a first and last name and names with something in between or two middle names. How can I extract the middle names from a Pandas dataframe? I can already extract the first and last names.

df = pd.read_csv("list.pip", sep="|")
df["First Name"] = 
np.where(df["Type"]=="I",df['Complete_Name'].str.split(',').str.get(1) , df[""])
df["Last Name"] = np.where(df["Type"]=="I",df['Complete_Name'].str.split(' ').str.get(1) , df[""])

The desired results should look like this:

ID| Complete_Name               | Type | First Name | Middle Name | Last Name
1 | JERRY, Ben                  | "I"  | Ben        |             | JERRY
2 | VON HELSINKI, Olga          | "I"  | Olga       |             |
3 | JENSEN, James Goodboy Dean  | "I"  | James      | Goodboy Dean| VON HELSINKI
4 | THE COMPANY                 | "C"  |            |             |
5 | CRUZ, Juan S. de la         | "I"  | Juan       | S. de la    | CRUZ
like image 954
mrPy Avatar asked Dec 23 '22 02:12


2 Answers

A single str.extract call will work here:

p = r'^(?P<Last_Name>.*), (?P<First_Name>\S+)\b\s*(?P<Middle_Name>.*)' 
u = df.loc[df.Type == "I", 'Complete_Name'].str.extract(p)
pd.concat([df, u], axis=1).fillna('')

   ID               Complete_Name Type     Last_Name First_Name   Middle_Name
0   1                  JERRY, Ben    I         JERRY        Ben              
1   2          VON HELSINKI, Olga    I  VON HELSINKI       Olga              
2   3  JENSEN, James Goodboy Dean    I        JENSEN      James  Goodboy Dean
3   4                 THE COMPANY    C                                       
4   5         CRUZ, Juan S. de la    I          CRUZ       Juan      S. de la

Regex Breakdown

^                # Start-of-line
(?P<Last_Name>   # First named capture group - Last Name
    .*           # Match anything until...
,                # ...we see a comma
\s               # whitespace 
(?P<First_Name>  # Second capture group - First Name
    \S+          # Match all non-whitespace characters
\b               # Word boundary 
\s*              # Optional whitespace chars (mostly housekeeping) 
(?P<Middle_Name> # Third capture group - Zero of more middle names 
    .*           # Match everything till the end of string
like image 71
cs95 Avatar answered Jan 05 '23 01:01


I think you can do:

# take the complete_name column and split it multiple times
df2 = (df.loc[df['Type'].eq('I'),'Complete_Name'].str
       .split(',', expand=True)

# remove extra spaces 
for x in df2.columns:
    df2[x] = [x.strip() for x in df2[x]]

# split the name on first space and join it
df2 = pd.concat([df2[0],df2[1].str.split(' ',1, expand=True)], axis=1)
df2.columns = ['last','first','middle']

# join the data frames
df = pd.concat([df[['ID','Complete_Name']], df2], axis=1)

# rearrange columns - not necessary though
df = df[['ID','Complete_Name','first','middle','last']]

# remove none values
df = df.replace([None], '')

   ID                  Complete_Name Type  first        middle          last
0   1   JERRY, Ben                      I    Ben                       JERRY
1   2   VON HELSINKI, Olga              I   Olga                VON HELSINKI
2   3   JENSEN, James Goodboy Dean      I  James  Goodboy Dean        JENSEN
3   4   THE COMPANY                     C                                   
4   5   CRUZ, Juan S. de la             I   Juan      S. de la          CRUZ
like image 29
YOLO Avatar answered Jan 05 '23 00:01