Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a fast way to get the tokens for each sentence in spaCy?

Tags:

spacy

To split my sentence into tokens I'm doing the following whichi is slow

 import spacy nlp = spacy.load("en_core_web_lg")

 text = "This is a test. This is another test"

 sentence_tokens = []
 doc = nlp(text) 
 for sent in doc.sents:
     words = nlp(sent.text)
     all = []
     for w in words:
         all.append(w)
         sentence_tokens.append(all)

I kind of want to do this the way nltk handles it where you split the text into sentences using sent_tokenize() and then for each sentence run word_tokenize()

like image 880
erotavlas Avatar asked Aug 27 '19 15:08

erotavlas


People also ask

How do you Tokenize sentences in spaCy?

In Spacy, the process of tokenizing a text into segments of words and punctuation is done in various steps. It processes the text from left to right. First, the tokenizer split the text on whitespace similar to the split() function. Then the tokenizer checks whether the substring matches the tokenizer exception rules.

Is spaCy better than NLTK?

While NLTK provides access to many algorithms to get something done, spaCy provides the best way to do it. It provides the fastest and most accurate syntactic analysis of any NLP library released to date. It also offers access to larger word vectors that are easier to customize.

What does NLP () do in spaCy?

When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the trained pipelines typically include a tagger, a lemmatizer, a parser and an entity recognizer.


2 Answers

The main problem with your approach is that you're processing everything twice. A sentence in doc.sents is a Span object, i.e. a sequence of Tokens. So there's no need to call nlp on the sentence text again – spaCy already does all of this for you under the hood and the Doc you get back already includes all information you need.

So if you need a list of strings, one for each token, you can do:

sentence_tokens = []
for sent in doc.sents:
    sentence_tokens.append([token.text for token in sent])

Or even shorter:

sentence_tokens = [[token.text for token in sent] for sent in doc.sents]

If you're processing a lot of texts, you probably also want to use nlp.pipe to make it more efficient. This will process the texts in batches and yield Doc objects. You can read more about it here.

texts = ["Some text", "Lots and lots of texts"]
for doc in nlp.pipe(texts):
   sentence_tokens = [[token.text for token in sent] for sent in doc.sents]
   # do something with the tokens 
like image 131
Ines Montani Avatar answered Nov 15 '22 08:11

Ines Montani


To just do the rule-based tokenization, which is very fast, run:

nlp = spacy.load('en_core_web_sm') # no need for large model
doc = nlp.make_doc(text)
print([token.text for token in doc])

There won't be sentence boundaries, though. For that you still currently need the parser. If you want tokens and sentence boundaries:

nlp = spacy.load("en_core_web_sm", disable=["tagger", "ner"]) # just the parser
doc = nlp(text)
print([token.text for token in doc])
print([sent.text for sent in doc.sents])

If you have a lot of texts, run nlp.tokenizer.pipe(texts) (similar to make_doc()) or nlp.pipe(texts).

(Once you've run doc = nlp(text), you don't need to run it again on the sentences within the loop. All the annotation should be there and you'll just be duplicating annotation. That would be particularly slow.)

like image 31
aab Avatar answered Nov 15 '22 07:11

aab