I want to extract a substring (Titles - Mr. Mrs, Miss etc.) from a column (Name) in a pandas dataframe and then write the new column (Title) back into the dataframe.
In the Name column of the dataframe I have a name such as "Brand, Mr. Owen Harris" The two delimiters are the , and .
I have attempted to use a split method, but this only splits the original string in two within a list. So I still send up ['Braund', ' Mr. Owen Harris'] in the list.
import pandas as pd
#import re
df_Train = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vTliZmavBsJCFDiEwxcSIIftu-0gR9p34n8Bq4OUNL4TxwHY-JMS6KhZEbWr1bp91UqHPkliZBBFgwh/pub?gid=1593012114&single=true&output=csv')
a= df_Train['Name'].str.split(',')
for i in a:
print(i[1])
I am thinking this might be situation where regex comes into play. My reading suggests a Lookahead (?=,) and Lookbehind (?<='.') approach should do the trick. for example
import re
a= df_Train['Name'].str.split(r'(?=,)*(?<='.'))
for i in a:
print(i)
print(i[1])`
But I am running into errors (EOL while scanning string literal) . Can someone point me in the right direction?
Cheers Mike
You do it like this.
df_Train.Name.str.split(',').str[1].str.split('.').str[0].str.strip()
Output head(5):
0 Mr
1 Mrs
2 Miss
3 Mrs
4 Mr
Summation of results
df_Train.Name.str.split(',').str[1].str.split('.').str[0].str.strip()\
.value_counts()
Output
Mr 517
Miss 182
Mrs 125
Master 40
Dr 7
Rev 6
Mlle 2
Col 2
Major 2
Lady 1
Mme 1
Sir 1
Ms 1
the Countess 1
Jonkheer 1
Don 1
Capt 1
Name: Name, dtype: int64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With