Is there a fast way to get the tokens for each sentence in spaCy?

Tags:

spacy

To split my sentence into tokens I'm doing the following whichi is slow

 import spacy nlp = spacy.load("en_core_web_lg")

 text = "This is a test. This is another test"

 sentence_tokens = []
 doc = nlp(text) 
 for sent in doc.sents:
     words = nlp(sent.text)
     all = []
     for w in words:
         all.append(w)
         sentence_tokens.append(all)

I kind of want to do this the way nltk handles it where you split the text into sentences using sent_tokenize() and then for each sentence run word_tokenize()

880

asked Aug 27 '19 15:08

2 Answers

The main problem with your approach is that you're processing everything twice. A sentence in doc.sents is a Span object, i.e. a sequence of Tokens. So there's no need to call nlp on the sentence text again – spaCy already does all of this for you under the hood and the Doc you get back already includes all information you need.

So if you need a list of strings, one for each token, you can do:

sentence_tokens = []
for sent in doc.sents:
    sentence_tokens.append([token.text for token in sent])

Or even shorter:

sentence_tokens = [[token.text for token in sent] for sent in doc.sents]

If you're processing a lot of texts, you probably also want to use nlp.pipe to make it more efficient. This will process the texts in batches and yield Doc objects. You can read more about it here.

texts = ["Some text", "Lots and lots of texts"]
for doc in nlp.pipe(texts):
   sentence_tokens = [[token.text for token in sent] for sent in doc.sents]
   # do something with the tokens

131

answered Nov 15 '22 08:11

Ines Montani

To just do the rule-based tokenization, which is very fast, run:

nlp = spacy.load('en_core_web_sm') # no need for large model
doc = nlp.make_doc(text)
print([token.text for token in doc])

There won't be sentence boundaries, though. For that you still currently need the parser. If you want tokens and sentence boundaries:

nlp = spacy.load("en_core_web_sm", disable=["tagger", "ner"]) # just the parser
doc = nlp(text)
print([token.text for token in doc])
print([sent.text for sent in doc.sents])

If you have a lot of texts, run nlp.tokenizer.pipe(texts) (similar to make_doc()) or nlp.pipe(texts).

(Once you've run doc = nlp(text), you don't need to run it again on the sentences within the loop. All the annotation should be there and you'll just be duplicating annotation. That would be particularly slow.)

answered Nov 15 '22 07:11

aab

Related questions
                            
                                Sentence split using spacy sentenizer
                            
                                spacy Entity Linking - Word Vectors
                            
                                How to use SyntaxNet parser/tagger with spaCy API?
                            
                                How to train NER and integrate it into the original model using Spacy
                            
                                Is it possible to get a confidence score on Spacy Named-entity recognition
                            
                                Model() got multiple values for argument 'nr_class' - SpaCy multi-classification model (BERT integration)
                            
                                Training times for Spacy Entity Linking model
                            
                                How to get probability of prediction per entity from Spacy NER model?
                            
                                How determine if a token is part of an entity within Spacy?
                            
                                What is the best way to get accurate text similarity in python for comparing single words or bigrams?
                            
                                Trivial example using spaCy Matcher not working
                            
                                Use spacy Spanish Tokenizer
                            
                                How To Parse Verbs Using Spacy
                            
                                When I save the output of displacy.render(doc, style="dep") to a svg file, there is a TypeError: write() argument must be str, not None
                            
                                Spacy: How to get all words that describe a noun?
                            
                                How to get Index of an Entity in a Sentence in Spacy?
                            
                                Proper way to add new vectors for OOV words
                            
                                Jupyter Notebook Python Error while Importing Spacy : No module named click._bashcomplete
                            
                                What is the recommended way to serialize a collection of spaCy Docs?
                            
                                Kernel Died when running Neuralcoref

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there a fast way to get the tokens for each sentence in spaCy?

Tags:

spacy

erotavlas

People also ask

2 Answers

Ines Montani

aab

Recent Activity

Donate For Us