Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting a person's age from unstructured text in Python

I have a dataset of administrative filings that include short biographies. I am trying to extract people's ages by using python and some pattern matching. Some example of sentences are:

  • "Mr Bond, 67, is an engineer in the UK"
  • "Amanda B. Bynes, 34, is an actress"
  • "Peter Parker (45) will be our next administrator"
  • "Mr. Dylan is 46 years old."
  • "Steve Jones, Age: 32,"

These are some of the patterns I have identified in the dataset. I want to add that there are other patterns, but I have not run into them yet, and not sure how I could get to that. I wrote the following code that works pretty well, but is pretty inefficient so will take too much time to run on the whole dataset.

#Create a search list of expressions that might come right before an age instance
age_search_list = [" " + last_name.lower().strip() + ", age ",
" " + clean_sec_last_name.lower().strip() + " age ",
last_name.lower().strip() + " age ",
full_name.lower().strip() + ", age ",
full_name.lower().strip() + ", ",
" " + last_name.lower() + ", ",
" " + last_name.lower().strip()  + " \(",
" " + last_name.lower().strip()  + " is "]

#for each element in our search list
for element in age_search_list:
    print("Searching: ",element)

    # retrieve all the instances where we might have an age
    for age_biography_instance in re.finditer(element,souptext.lower()):

        #extract the next four characters
        age_biography_start = int(age_biography_instance.start())
        age_instance_start = age_biography_start + len(element)
        age_instance_end = age_instance_start + 4
        age_string = souptext[age_instance_start:age_instance_end]

        #extract what should be the age
        potential_age = age_string[:-2]

        #extract the next two characters as a security check (i.e. age should be followed by comma, or dot, etc.)
        age_security_check = age_string[-2:]
        age_security_check_list = [", ",". ",") "," y"]

        if age_security_check in age_security_check_list:
            print("Potential age instance found for ",full_name,": ",potential_age)

            #check that what we extracted is an age, convert it to birth year
            try:
                potential_age = int(potential_age)
                print("Potential age detected: ",potential_age)
                if 18 < int(potential_age) < 100:
                    sec_birth_year = int(filing_year) - int(potential_age)
                    print("Filing year was: ",filing_year)
                    print("Estimated birth year for ",clean_sec_full_name,": ",sec_birth_year)
                    #Now, we save it in the main dataframe
                    new_sec_parser = pd.DataFrame([[clean_sec_full_name,"0","0",sec_birth_year,""]],columns = ['Name','Male','Female','Birth','Suffix'])
                    df_sec_parser = pd.concat([df_sec_parser,new_sec_parser])

            except ValueError:
                print("Problem with extracted age ",potential_age)

I have a few questions:

  • Is there a more efficient way to extract this information?
  • Should I use a regex instead?
  • My text documents are very long and I have lots of them. Can I do one search for all the items at once?
  • What would be a strategy to detect other patterns in the dataset?

Some sentences extracted from the dataset:

  • "Equity awards granted to Mr. Love in 2010 represented 48% of his total compensation"
  • "George F. Rubin(14)(15) Age 68 Trustee since: 1997."
  • "INDRA K. NOOYI, 56, has been PepsiCos Chief Executive Officer (CEO) since 2006"
  • "Mr. Lovallo, 47, was appointed Treasurer in 2011."
  • "Mr. Charles Baker, 79, is a business advisor to biotechnology companies."
  • "Mr. Botein, age 43, has been a member of our Board since our formation."
like image 974
user1029296 Avatar asked Aug 07 '19 13:08

user1029296


3 Answers

Since your text has to be processed, and not only pattern matched, the correct approach is to use one of the many NLP tools available out there.

Your aim is to use Named Entity Recognition (NER) which is usually done based on Machine Learning Models. The NER activity attempts to recognize a determined set of Entity Types in text. Examples are: Locations, Dates, Organizations and Person names.

While not 100% precise, this is much more precise than simple pattern matching (especially for english), since it relies on other information other than Patterns, such as Part of Speech (POS), Dependency Parsing, etc.

Take a look on the results I obtained for the phrases you provided by using Allen NLP Online Tool (using fine-grained-NER model):

  • "Mr Bond, 67, is an engineer in the UK":

Mr Bond, 67, is an engineer in the UK

  • "Amanda B. Bynes, 34, is an actress"

Amanda B. Bynes, 34, is an actress

  • "Peter Parker (45) will be our next administrator"

Peter Parker (45) will be our next administrator

  • "Mr. Dylan is 46 years old."

Mr. Dylan is 46 years old.

  • "Steve Jones, Age: 32,"

Steve Jones, Age: 32,

Notice that this last one is wrong. As I said, not 100%, but easy to use.

The big advantage of this approach: you don't have to make a special pattern for every one of the millions of possibilities available.

The best thing: you can integrate it into your Python code:

pip install allennlp

And:

from allennlp.predictors import Predictor
al = Predictor.from_path("https://s3-us-west-2.amazonaws.com/allennlp/models/fine- 
grained-ner-model-elmo-2018.12.21.tar.gz")
al.predict("Your sentence with date here")

Then, look at the resulting dict for "Date" Entities.

Same thing goes for Spacy:

!python3 -m spacy download en_core_web_lg
import spacy
sp_lg = spacy.load('en_core_web_lg')
{(ent.text.strip(), ent.label_) for ent in sp_lg("Your sentence with date here").ents}

(However, I had some bad experiences with bad predictions there - although it is considered better).

For more info, read this interesting article at Medium: https://medium.com/@b.terryjack/nlp-pretrained-named-entity-recognition-7caa5cd28d7b

like image 98
Tiago Duque Avatar answered Oct 12 '22 22:10

Tiago Duque


import re 

x =["Mr Bond, 67, is an engineer in the UK"
,"Amanda B. Bynes, 34, is an actress"
,"Peter Parker (45) will be our next administrator"
,"Mr. Dylan is 46 years old."
,"Steve Jones, Age:32,"]

[re.findall(r'\d{1,3}', i)[0] for i in x] # ['67', '34', '45', '46', '32']
like image 22
ComplicatedPhenomenon Avatar answered Oct 12 '22 23:10

ComplicatedPhenomenon


This will work for all the cases you provided: https://repl.it/repls/NotableAncientBackground

import re 

input =["Mr Bond, 67, is an engineer in the UK"
,"Amanda B. Bynes, 34, is an actress"
,"Peter Parker (45) will be our next administrator"
,"Mr. Dylan is 46 years old."
,"Steve Jones, Age:32,", "Equity awards granted to Mr. Love in 2010 represented 48% of his total compensation",
"George F. Rubin(14)(15) Age 68 Trustee since: 1997.",
"INDRA K. NOOYI, 56, has been PepsiCos Chief Executive Officer (CEO) since 2006",
"Mr. Lovallo, 47, was appointed Treasurer in 2011.",
"Mr. Charles Baker, 79, is a business advisor to biotechnology companies.",
"Mr. Botein, age 43, has been a member of our Board since our formation."]
for i in input:
  age = re.findall(r'Age[\:\s](\d{1,3})', i)
  age.extend(re.findall(r' (\d{1,3}),? ', i))
  if len(age) == 0:
    age = re.findall(r'\((\d{1,3})\)', i)
  print(i+ " --- AGE: "+ str(set(age)))

Returns

Mr Bond, 67, is an engineer in the UK --- AGE: {'67'}
Amanda B. Bynes, 34, is an actress --- AGE: {'34'}
Peter Parker (45) will be our next administrator --- AGE: {'45'}
Mr. Dylan is 46 years old. --- AGE: {'46'}
Steve Jones, Age:32, --- AGE: {'32'}
Equity awards granted to Mr. Love in 2010 represented 48% of his total compensation --- AGE: set()
George F. Rubin(14)(15) Age 68 Trustee since: 1997. --- AGE: {'68'}
INDRA K. NOOYI, 56, has been PepsiCos Chief Executive Officer (CEO) since 2006 --- AGE: {'56'}
Mr. Lovallo, 47, was appointed Treasurer in 2011. --- AGE: {'47'}
Mr. Charles Baker, 79, is a business advisor to biotechnology companies. --- AGE: {'79'}
Mr. Botein, age 43, has been a member of our Board since our formation. --- AGE: {'43'}
like image 40
Sheshank S. Avatar answered Oct 13 '22 00:10

Sheshank S.