separate texts into sentences NLTK vs spaCy

Question

I want to separate texts into sentences.

looking in stack overflow I found:

WITH NLTK

from nltk.tokenize import sent_tokenize
text="""Hello Mr. Smith, how are you doing today? The weathe is great, and city is awesome. The sky is pinkish-blue. You shouldn't eat cardboard"""
tokenized_text=sent_tokenize(text)
print(tokenized_text)

WITH SPACY

from spacy.lang.en import English # updated

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
nlp.add_pipe(nlp.create_pipe('sentencizer')) # updated
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]

The question is what in the background for spacy having to do it differently with a so called create_pipe. Sentences are important for training your own word embedings for NLP. There should be a reason why spaCy does not include directly out of the box a sentence tokenizer.

Thanks.

NOTE: Be aware that a simply .split(.) does not work, there are several decimal numbers in the text and other kind of tokens containing '.'

ongenz · Accepted Answer

By default, spaCy uses its dependency parser to do sentence segmentation, which requires loading a statistical model. The sentencizer is a rule-based sentence segmenter that you can use to define your own sentence segmentation rules without loading a model.

If you don't mind leaving the parser activated, you can use the following code:

import spacy
nlp = spacy.load('en_core_web_sm') # or whatever model you have installed
raw_text = 'Hello, world. Here are two sentences.'
doc = nlp(raw_text)
sentences = [sent.text.strip() for sent in doc.sents]

separate texts into sentences NLTK vs spaCy

Tags:

python

nlp

nltk

spacy

sentence

JFerro

1 Answers

ongenz

Recent Activity

Donate For Us

separate texts into sentences NLTK vs spaCy

Tags:

python

nlp

nltk

spacy

sentence

JFerro

1 Answers

ongenz

Related questions

Recent Activity

Donate For Us