I have tried to remove words from a document that are considered to be named entities by spacy, so basically removing "Sweden" and "Nokia" from the string example. I could not find a way to work around the problem that entities are stored as a span. So when comparing them with single tokens from a spacy doc, it prompts an error.
In a later step, this process is supposed to be a function applied to several text documents stored in a pandas data frame.
I would appreciate any kind of help and advice on how to maybe better post questions as this is my first one here.
nlp = spacy.load('en')
text_data = u'This is a text document that speaks about entities like Sweden and Nokia'
document = nlp(text_data)
text_no_namedentities = []
for word in document:
if word not in document.ents:
text_no_namedentities.append(word)
return " ".join(text_no_namedentities)
It creates the following error:
TypeError: Argument 'other' has incorrect type (expected spacy.tokens.token.Token, got spacy.tokens.span.Span)
Named Entities can be a place, person, organization, time, object, or geographic entity.
NLP helps you extract insights from unstructured text and has several use cases, such as: Automatic summarization. Named entity recognition. Question answering systems.
A Doc is a sequence of Token objects. Access sentences and named entities, export annotations to numpy arrays, losslessly serialize to compressed binary strings. The Doc object holds an array of TokenC structs. The Python-level Token and Span objects are views of this array, i.e. they don't own the data themselves.
While NLTK provides access to many algorithms to get something done, spaCy provides the best way to do it. It provides the fastest and most accurate syntactic analysis of any NLP library released to date. It also offers access to larger word vectors that are easier to customize.
This will not handle entities covering multiple tokens.
import spacy
nlp = spacy.load('en_core_web_sm')
text_data = 'New York is in USA'
document = nlp(text_data)
text_no_namedentities = []
ents = [e.text for e in document.ents]
for item in document:
if item.text in ents:
pass
else:
text_no_namedentities.append(item.text)
print(" ".join(text_no_namedentities))
Output
'New York is in'
Here USA
is correctly removed but couldn't eliminate New York
Solution
import spacy
nlp = spacy.load('en_core_web_sm')
text_data = 'New York is in USA'
document = nlp(text_data)
print(" ".join([ent.text for ent in document if not ent.ent_type_]))
Output
'is in'
This will get you the result you're asking for. Reviewing the Named Entity Recognition should help you going forward.
import spacy
nlp = spacy.load('en_core_web_sm')
text_data = 'This is a text document that speaks about entities like Sweden and Nokia'
document = nlp(text_data)
text_no_namedentities = []
ents = [e.text for e in document.ents]
for item in document:
if item.text in ents:
pass
else:
text_no_namedentities.append(item.text)
print(" ".join(text_no_namedentities))
Output:
This is a text document that speaks about entities like and
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With