I have a sentence that has already been tokenized into words. I want to get the part of speech tag for each word in the sentence. When I check the documentation in SpaCy I realized it starts with the raw sentence. I don't want to do that because in that case, the spacy might end up with a different tokenization. Therefore, I wonder if using spaCy with the list of words (rather than a string) is possible or not ?
Here is an example about my question:
# I know that it does the following sucessfully :
import spacy
nlp = spacy.load('en_core_web_sm')
raw_text = 'Hello, world.'
doc = nlp(raw_text)
for token in doc:
print(token.pos_)
But I want to do something similar to the following:
import spacy
nlp = spacy.load('en_core_web_sm')
tokenized_text = ['Hello',',','world','.']
doc = nlp(tokenized_text)
for token in doc:
print(token.pos_)
I know, it doesn't work, but is it possible to do something similar to that ?
In Spacy, the process of tokenizing a text into segments of words and punctuation is done in various steps. It processes the text from left to right. First, the tokenizer split the text on whitespace similar to the split() function. Then the tokenizer checks whether the substring matches the tokenizer exception rules.
While NLTK provides access to many algorithms to get something done, spaCy provides the best way to do it. It provides the fastest and most accurate syntactic analysis of any NLP library released to date. It also offers access to larger word vectors that are easier to customize.
Tokenization is the process of replacing sensitive data with unique identification symbols that retain all the essential information about the data without compromising its security.
Text: The original token text. Dep: The syntactic relation connecting child to head. Head text: The original text of the token head. Head POS: The part-of-speech tag of the token head.
Use the Doc
object
import spacy
from spacy.tokens import Doc
nlp = spacy.load("en_core_web_sm")
sents = [['Hello', ',','world', '.']]
for sent in sents:
doc = Doc(nlp.vocab, sent)
for token in nlp(doc):
print(token.text, token.pos_)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With