Removing named entities from a document using spacy

Tags:

I have tried to remove words from a document that are considered to be named entities by spacy, so basically removing "Sweden" and "Nokia" from the string example. I could not find a way to work around the problem that entities are stored as a span. So when comparing them with single tokens from a spacy doc, it prompts an error.

In a later step, this process is supposed to be a function applied to several text documents stored in a pandas data frame.

I would appreciate any kind of help and advice on how to maybe better post questions as this is my first one here.


nlp = spacy.load('en')

text_data = u'This is a text document that speaks about entities like Sweden and Nokia'

document = nlp(text_data)

text_no_namedentities = []

for word in document:
    if word not in document.ents:
        text_no_namedentities.append(word)

return " ".join(text_no_namedentities)

It creates the following error:

TypeError: Argument 'other' has incorrect type (expected spacy.tokens.token.Token, got spacy.tokens.span.Span)

434

asked Dec 12 '19 22:12

john_28

2 Answers

This will not handle entities covering multiple tokens.

import spacy
nlp = spacy.load('en_core_web_sm')
text_data = 'New York is in USA'
document = nlp(text_data)

text_no_namedentities = []
ents = [e.text for e in document.ents]
for item in document:
    if item.text in ents:
        pass
    else:
        text_no_namedentities.append(item.text)
print(" ".join(text_no_namedentities))

Output

'New York is in'

Here USA is correctly removed but couldn't eliminate New York

Solution

import spacy
nlp = spacy.load('en_core_web_sm')
text_data = 'New York is in USA'
document = nlp(text_data)
print(" ".join([ent.text for ent in document if not ent.ent_type_]))

Output

'is in'

answered Sep 18 '22 14:09

kochar96

This will get you the result you're asking for. Reviewing the Named Entity Recognition should help you going forward.

import spacy

nlp = spacy.load('en_core_web_sm')

text_data = 'This is a text document that speaks about entities like Sweden and Nokia'

document = nlp(text_data)

text_no_namedentities = []

ents = [e.text for e in document.ents]
for item in document:
    if item.text in ents:
        pass
    else:
        text_no_namedentities.append(item.text)
print(" ".join(text_no_namedentities))

Output:

This is a text document that speaks about entities like and

answered Sep 21 '22 14:09

APhillips

Related questions
                            
                                AttributeError: 'MSVCCompiler' object has no attribute 'linker_exe'
                            
                                Python generics and subclasses
                            
                                How to open an image from an url with opencv using requests from python
                            
                                Detecting current async library
                            
                                Lambda Python to Query SSM Parameter Store Value
                            
                                How to check for new files in a folder in python
                            
                                import in python 3, explain the output please
                            
                                How can I remove a NavigableString from the tree?
                            
                                matplotlib.font_manager debug messages in log file
                            
                                Unauthorized response to POST request in Django Rest Framework with JWT Token
                            
                                ERROR: Could not find a version that satisfies the requirement tensorflow (from versions: none) ERROR: No matching distribution found for tensorflow)
                            
                                How to select rows in Pandas dataframe based on string matching in multiple columns
                            
                                set multiple column values to NaN based on condition
                            
                                Issue processing data read from serial port, when displaying it in a Tkinter textbox
                            
                                Python Error : (fields.E304) Reverse accessor for field clashes with reverse accessor for another field
                            
                                How to make trailing slash optional in django
                            
                                How to generate a PDF with a given template, with dynamic data in Python or NodeJS to be deployed on AWS
                            
                                Tensorflow 2.0: How to change the output signature while using tf.saved_model
                            
                                How do I rename a key while preserving order in dictionaries (Python 3.7+)?
                            
                                Plotting spatial data on individual map using altair

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Removing named entities from a document using spacy

Tags:

python

text

nlp

spacy

john_28

People also ask

2 Answers

kochar96

APhillips

Recent Activity

Donate For Us