I have a dataset of administrative filings that include short biographies. I am trying to extract people's ages by using python and some pattern matching. Some example of sentences are:
These are some of the patterns I have identified in the dataset. I want to add that there are other patterns, but I have not run into them yet, and not sure how I could get to that. I wrote the following code that works pretty well, but is pretty inefficient so will take too much time to run on the whole dataset.
#Create a search list of expressions that might come right before an age instance
age_search_list = [" " + last_name.lower().strip() + ", age ",
" " + clean_sec_last_name.lower().strip() + " age ",
last_name.lower().strip() + " age ",
full_name.lower().strip() + ", age ",
full_name.lower().strip() + ", ",
" " + last_name.lower() + ", ",
" " + last_name.lower().strip() + " \(",
" " + last_name.lower().strip() + " is "]
#for each element in our search list
for element in age_search_list:
print("Searching: ",element)
# retrieve all the instances where we might have an age
for age_biography_instance in re.finditer(element,souptext.lower()):
#extract the next four characters
age_biography_start = int(age_biography_instance.start())
age_instance_start = age_biography_start + len(element)
age_instance_end = age_instance_start + 4
age_string = souptext[age_instance_start:age_instance_end]
#extract what should be the age
potential_age = age_string[:-2]
#extract the next two characters as a security check (i.e. age should be followed by comma, or dot, etc.)
age_security_check = age_string[-2:]
age_security_check_list = [", ",". ",") "," y"]
if age_security_check in age_security_check_list:
print("Potential age instance found for ",full_name,": ",potential_age)
#check that what we extracted is an age, convert it to birth year
try:
potential_age = int(potential_age)
print("Potential age detected: ",potential_age)
if 18 < int(potential_age) < 100:
sec_birth_year = int(filing_year) - int(potential_age)
print("Filing year was: ",filing_year)
print("Estimated birth year for ",clean_sec_full_name,": ",sec_birth_year)
#Now, we save it in the main dataframe
new_sec_parser = pd.DataFrame([[clean_sec_full_name,"0","0",sec_birth_year,""]],columns = ['Name','Male','Female','Birth','Suffix'])
df_sec_parser = pd.concat([df_sec_parser,new_sec_parser])
except ValueError:
print("Problem with extracted age ",potential_age)
I have a few questions:
Some sentences extracted from the dataset:
Since your text has to be processed, and not only pattern matched, the correct approach is to use one of the many NLP tools available out there.
Your aim is to use Named Entity Recognition (NER) which is usually done based on Machine Learning Models. The NER activity attempts to recognize a determined set of Entity Types in text. Examples are: Locations, Dates, Organizations and Person names.
While not 100% precise, this is much more precise than simple pattern matching (especially for english), since it relies on other information other than Patterns, such as Part of Speech (POS), Dependency Parsing, etc.
Take a look on the results I obtained for the phrases you provided by using Allen NLP Online Tool (using fine-grained-NER model):
Notice that this last one is wrong. As I said, not 100%, but easy to use.
The big advantage of this approach: you don't have to make a special pattern for every one of the millions of possibilities available.
The best thing: you can integrate it into your Python code:
pip install allennlp
And:
from allennlp.predictors import Predictor
al = Predictor.from_path("https://s3-us-west-2.amazonaws.com/allennlp/models/fine-
grained-ner-model-elmo-2018.12.21.tar.gz")
al.predict("Your sentence with date here")
Then, look at the resulting dict for "Date" Entities.
Same thing goes for Spacy:
!python3 -m spacy download en_core_web_lg
import spacy
sp_lg = spacy.load('en_core_web_lg')
{(ent.text.strip(), ent.label_) for ent in sp_lg("Your sentence with date here").ents}
(However, I had some bad experiences with bad predictions there - although it is considered better).
For more info, read this interesting article at Medium: https://medium.com/@b.terryjack/nlp-pretrained-named-entity-recognition-7caa5cd28d7b
import re
x =["Mr Bond, 67, is an engineer in the UK"
,"Amanda B. Bynes, 34, is an actress"
,"Peter Parker (45) will be our next administrator"
,"Mr. Dylan is 46 years old."
,"Steve Jones, Age:32,"]
[re.findall(r'\d{1,3}', i)[0] for i in x] # ['67', '34', '45', '46', '32']
This will work for all the cases you provided: https://repl.it/repls/NotableAncientBackground
import re
input =["Mr Bond, 67, is an engineer in the UK"
,"Amanda B. Bynes, 34, is an actress"
,"Peter Parker (45) will be our next administrator"
,"Mr. Dylan is 46 years old."
,"Steve Jones, Age:32,", "Equity awards granted to Mr. Love in 2010 represented 48% of his total compensation",
"George F. Rubin(14)(15) Age 68 Trustee since: 1997.",
"INDRA K. NOOYI, 56, has been PepsiCos Chief Executive Officer (CEO) since 2006",
"Mr. Lovallo, 47, was appointed Treasurer in 2011.",
"Mr. Charles Baker, 79, is a business advisor to biotechnology companies.",
"Mr. Botein, age 43, has been a member of our Board since our formation."]
for i in input:
age = re.findall(r'Age[\:\s](\d{1,3})', i)
age.extend(re.findall(r' (\d{1,3}),? ', i))
if len(age) == 0:
age = re.findall(r'\((\d{1,3})\)', i)
print(i+ " --- AGE: "+ str(set(age)))
Returns
Mr Bond, 67, is an engineer in the UK --- AGE: {'67'}
Amanda B. Bynes, 34, is an actress --- AGE: {'34'}
Peter Parker (45) will be our next administrator --- AGE: {'45'}
Mr. Dylan is 46 years old. --- AGE: {'46'}
Steve Jones, Age:32, --- AGE: {'32'}
Equity awards granted to Mr. Love in 2010 represented 48% of his total compensation --- AGE: set()
George F. Rubin(14)(15) Age 68 Trustee since: 1997. --- AGE: {'68'}
INDRA K. NOOYI, 56, has been PepsiCos Chief Executive Officer (CEO) since 2006 --- AGE: {'56'}
Mr. Lovallo, 47, was appointed Treasurer in 2011. --- AGE: {'47'}
Mr. Charles Baker, 79, is a business advisor to biotechnology companies. --- AGE: {'79'}
Mr. Botein, age 43, has been a member of our Board since our formation. --- AGE: {'43'}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With