I would like to parse a document using spaCy and apply a token filter so that the final spaCy document does not include the filtered tokens. I know that I can take the sequence of tokens filtered, but I am insterested in having the actual Doc
structure.
text = u"This document is only an example. " \
"I would like to create a custom pipeline that will remove specific tokesn from the final document."
doc = nlp(text)
def keep_token(tok):
# This is only an example rule
return tok.pos_ not not in {'PUNCT', 'NUM', 'SYM'}
final_tokens = list(filter(keep_token, doc))
# How to get a spacy.Doc from final_tokens?
I tried to reconstruct a new spaCy Doc
from the tokens lists but the API is not clear how to do it.
I am pretty sure that you have found your solution till now but because it is not posted here I thought it may be useful to add it.
You can remove tokens by converting doc to numpy array, removing from numpy array and then converting back to doc.
Code:
import spacy
from spacy.attrs import LOWER, POS, ENT_TYPE, IS_ALPHA
from spacy.tokens import Doc
import numpy
def remove_tokens_on_match(doc):
indexes = []
for index, token in enumerate(doc):
if (token.pos_ in ('PUNCT', 'NUM', 'SYM')):
indexes.append(index)
np_array = doc.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])
np_array = numpy.delete(np_array, indexes, axis = 0)
doc2 = Doc(doc.vocab, words=[t.text for i, t in enumerate(doc) if i not in indexes])
doc2.from_array([LOWER, POS, ENT_TYPE, IS_ALPHA], np_array)
return doc2
# load english model
nlp = spacy.load('en')
doc = nlp(u'This document is only an example. \
I would like to create a custom pipeline that will remove specific tokens from \
the final document.')
print(remove_tokens_on_match(doc))
You can look to a similar question that I answered here.
Depending on what you want to do there are several approaches.
1. Get the original Document
Tokens in SpaCy have references to their document, so you can do this:
original_doc = final_tokens[0].doc
This way you can still get PoS, parse data etc. from the original sentence.
2. Construct a new document without the removed tokens
You can append the strings of all the tokens with whitespace and create a new document. See the token docs for information on text_with_ws
.
doc = nlp(''.join(map(lambda x: x.text_with_ws, final_tokens)))
This is probably not going to give you what you want though - PoS tags will not necessarily be the same, and the resulting sentence may not make sense.
If neither of those was what you had in mind, let me know and maybe I can help.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With