Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to use spacy with already tokenized input?

Tags:

python

nlp

spacy

I have a sentence that has already been tokenized into words. I want to get the part of speech tag for each word in the sentence. When I check the documentation in SpaCy I realized it starts with the raw sentence. I don't want to do that because in that case, the spacy might end up with a different tokenization. Therefore, I wonder if using spaCy with the list of words (rather than a string) is possible or not ?

Here is an example about my question:

# I know that it does the following sucessfully :
import spacy
nlp = spacy.load('en_core_web_sm')
raw_text = 'Hello, world.'
doc = nlp(raw_text)
for token in doc:
    print(token.pos_)

But I want to do something similar to the following:

import spacy
nlp = spacy.load('en_core_web_sm')
tokenized_text = ['Hello',',','world','.']
doc = nlp(tokenized_text)
for token in doc:
    print(token.pos_)

I know, it doesn't work, but is it possible to do something similar to that ?

like image 868
zwlayer Avatar asked Dec 03 '18 13:12

zwlayer


People also ask

How do I use Tokenize text with spaCy?

In Spacy, the process of tokenizing a text into segments of words and punctuation is done in various steps. It processes the text from left to right. First, the tokenizer split the text on whitespace similar to the split() function. Then the tokenizer checks whether the substring matches the tokenizer exception rules.

Which is better NLTK or spaCy?

While NLTK provides access to many algorithms to get something done, spaCy provides the best way to do it. It provides the fastest and most accurate syntactic analysis of any NLP library released to date. It also offers access to larger word vectors that are easier to customize.

What does it mean to tokenize input?

Tokenization is the process of replacing sensitive data with unique identification symbols that retain all the essential information about the data without compromising its security.

What is token DEP_ in spaCy?

Text: The original token text. Dep: The syntactic relation connecting child to head. Head text: The original text of the token head. Head POS: The part-of-speech tag of the token head.


1 Answers

Use the Doc object

import spacy
from spacy.tokens import Doc
nlp = spacy.load("en_core_web_sm")

sents = [['Hello', ',','world', '.']]
for sent in sents:
    doc = Doc(nlp.vocab, sent)
    for token in nlp(doc):
        print(token.text, token.pos_)
like image 146
Victor Yan Avatar answered Oct 23 '22 13:10

Victor Yan