Extracting names from a text file using Spacy

Tags:

I have a text file which contains lines as shown below:

Electronically signed : Wes Scott, M.D.; Jun 26 2010 11:10AM CST

The patient was referred by Dr. Jacob Austin.  

Electronically signed by Robert Clowson, M.D.; Janury 15 2015 11:13AM CST

Electronically signed by Dr. John Douglas, M.D.; Jun 16 2017 11:13AM CST

The patient was referred by
Dr. Jayden Green Olivia.

I want to extract all names using Spacy. I am using Spacy's part of speech tagging and entity recognition but not able to get success. May I please know on how it could done? Any help would be appreciable

I am using some code in this way:

import spacy
nlp = spacy.load('en')
 document_string= " Electronically signed by stupid: Dr. John Douglas, M.D.; 
 Jun 13 2018 11:13AM CST"
doc = nlp(document_string)
 for sentence in doc.ents:
     print(sentence, sentence.label_)

719

asked Jul 24 '18 04:07

Slickmind

1 Answers

The issue with models accuracy

The problem with all models is that they don't have 100% accuracy and even using a bigger model doesn't help to recognize dates. Here are the accuracy values (F-score, precision, recall) for NER models--they are all around 86%.

document_string = """ 
Electronically signed : Wes Scott, M.D.; Jun 26 2010 11:10AM CST 
 The patient was referred by Dr. Jacob Austin.   
Electronically signed by Robert Clowson, M.D.; Janury 15 2015 11:13AM CST 
Electronically signed by Dr. John Douglas, M.D.; Jun 16 2017 11:13AM CST 
The patient was referred by 
Dr. Jayden Green Olivia.   
"""

With small model two date items are labelled as 'PERSON':

import spacy                                                                                                                            

nlp = spacy.load('en')                                                                                                                  
sents = nlp(document_string) 
 [ee for ee in sents.ents if ee.label_ == 'PERSON']                                                                                      
# Out:
# [Wes Scott,
#  Jun 26,
#  Jacob Austin,
#  Robert Clowson,
#  John Douglas,
#  Jun 16 2017,
#  Jayden Green Olivia]

With a larger model en_core_web_md the results are even worse in terms of precision, as there are three misclassified entities.

nlp = spacy.load('en_core_web_md')                                                                                                                  
sents = nlp(document_string) 
# Out:
#[Wes Scott,
# Jun 26,
# Jacob Austin,
# Robert Clowson,
# Janury,
# John Douglas,
# Jun 16 2017,
# Jayden Green Olivia]

I also tried other models (xx_ent_wiki_sm, en_core_web_md) and they don't bring any improvement as well.

What about using rules to improve accuracy?

In the small example not only the document seems to have a clear structure, but the misclassified entities are all dates. So why not combine the initial model with a rule-based component?

The good news is that in Spacy:

it's possible can combine statistical and rule-based components in a variety of ways. Rule-based components can be used to improve the accuracy of statistical models

(from https://spacy.io/usage/rule-based-matching#models-rules)

So, by following the example and using the dateparser library (a parser for human readable dates) I've put together a rule-based component that works very well on this example:

from spacy.tokens import Span
import dateparser

def expand_person_entities(doc):
    new_ents = []
    for ent in doc.ents:
        # Only check for title if it's a person and not the first token
        if ent.label_ == "PERSON":
            if ent.start != 0:
                # if person preceded by title, include title in entity
                prev_token = doc[ent.start - 1]
                if prev_token.text in ("Dr", "Dr.", "Mr", "Mr.", "Ms", "Ms."):
                    new_ent = Span(doc, ent.start - 1, ent.end, label=ent.label)
                    new_ents.append(new_ent)
                else:
                    # if entity can be parsed as a date, it's not a person
                    if dateparser.parse(ent.text) is None:
                        new_ents.append(ent) 
        else:
            new_ents.append(ent)
    doc.ents = new_ents
    return doc

# Add the component after the named entity recognizer
# nlp.remove_pipe('expand_person_entities')
nlp.add_pipe(expand_person_entities, after='ner')

doc = nlp(document_string)
[(ent.text, ent.label_) for ent in doc.ents if ent.label_=='PERSON']
# Out:
# [(‘Wes Scott', 'PERSON'),
#  ('Dr. Jacob Austin', 'PERSON'),
#  ('Robert Clowson', 'PERSON'),
#  ('Dr. John Douglas', 'PERSON'),
#  ('Dr. Jayden Green Olivia', 'PERSON')]

answered Sep 29 '22 19:09

user2314737

Related questions
                            
                                Flask-RESTplus CORS request not adding headers into the response
                            
                                TypeError: products() got multiple values for argument 'pk'
                            
                                How to customize the pytest name
                            
                                How to list all the pairs of numbers which fall under a group of range?
                            
                                DataFrame to Json Using First Col as Key and Second as Value
                            
                                Scipy griddata with 'linear' and 'cubic' yields nan
                            
                                How to monkey patch a function for multiple tests
                            
                                Check equality of two axes in multidiimensional numpy array
                            
                                Is it possible to ignore only certain error codes for entire files in Flake8?
                            
                                Matplotlib animation not displaying in PyCharm
                            
                                iterable from pandas dataframe
                            
                                python list to csv file with each item in new line
                            
                                How do I reverse the strings contained in each pair of matching parentheses, starting from the innermost pair? CodeFights
                            
                                Shift rows of a numpy array independently
                            
                                How to integrate Stripe payments gateway with Django Oscar?
                            
                                What's the best way of centre cropping images in python?
                            
                                WindowsContext: OleInitialize() failed: "COM error 0x80010106 RPC_E_CHANGED_MODE (Unknown error 0x0ffffffff80010106)"
                            
                                Remove wavy noise from image background using OpenCV
                            
                                Efficient 2D cross correlation in Python?
                            
                                multiprocessing gives AssertionError: daemonic processes are not allowed to have children

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Extracting names from a text file using Spacy

Tags:

python

nlp

nltk

named-entity-recognition

spacy