Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

separate texts into sentences NLTK vs spaCy

I want to separate texts into sentences.

looking in stack overflow I found:

WITH NLTK

from nltk.tokenize import sent_tokenize
text="""Hello Mr. Smith, how are you doing today? The weathe is great, and city is awesome. The sky is pinkish-blue. You shouldn't eat cardboard"""
tokenized_text=sent_tokenize(text)
print(tokenized_text)

WITH SPACY

from spacy.lang.en import English # updated

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
nlp.add_pipe(nlp.create_pipe('sentencizer')) # updated
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]

The question is what in the background for spacy having to do it differently with a so called create_pipe. Sentences are important for training your own word embedings for NLP. There should be a reason why spaCy does not include directly out of the box a sentence tokenizer.

Thanks.

NOTE: Be aware that a simply .split(.) does not work, there are several decimal numbers in the text and other kind of tokens containing '.'

like image 228
JFerro Avatar asked Aug 30 '25 17:08

JFerro


1 Answers

By default, spaCy uses its dependency parser to do sentence segmentation, which requires loading a statistical model. The sentencizer is a rule-based sentence segmenter that you can use to define your own sentence segmentation rules without loading a model.

If you don't mind leaving the parser activated, you can use the following code:

import spacy
nlp = spacy.load('en_core_web_sm') # or whatever model you have installed
raw_text = 'Hello, world. Here are two sentences.'
doc = nlp(raw_text)
sentences = [sent.text.strip() for sent in doc.sents]
like image 113
ongenz Avatar answered Sep 02 '25 07:09

ongenz