I am using nltk's PunkSentenceTokenizer
to tokenize a text to a set of sentences. However, the tokenizer doesn't seem to consider new paragraph or new lines as a new sentence.
>>> from nltk.tokenize.punkt import PunktSentenceTokenizer
>>> tokenizer = PunktSentenceTokenizer()
>>> tokenizer.tokenize('Sentence 1 \n Sentence 2. Sentence 3.')
['Sentence 1 \n Sentence 2.', 'Sentence 3.']
>>> tokenizer.span_tokenize('Sentence 1 \n Sentence 2. Sentence 3.')
[(0, 24), (25, 36)]
I would like it to to consider new lines as a boundary of sentences as well. Anyway to do this (I need to save the offsets too)?
Sentence tokenization is the process of splitting text into individual sentences. For literature, journalism, and formal documents the tokenization algorithms built in to spaCy perform well, since the tokenizer is trained on a corpus of formal English text.
NLTK contains a module called tokenize() which further classifies into two sub-categories: Word tokenize: We use the word_tokenize() method to split a sentence into tokens or words. Sentence tokenize: We use the sent_tokenize() method to split a document or paragraph into sentences.
tokenize() ,which returns a list, will ignore empty string (when a delimiter appears twice in succession) where as split() keeps such string. The split() can take regex as delimiter where as tokenize does not.
NLTK Tokenization is used for parsing a large amount of textual data into parts to perform an analysis of the character of the text. NLTK for tokenization can be used for training machine learning models, Natural Language Processing text cleaning.
Well, I had the same problem and what I have done was split the text in '\n'. Something like this:
# in my case, when it had '\n', I called it a new paragraph,
# like a collection of sentences
paragraphs = [p for p in text.split('\n') if p]
# and here, sent_tokenize each one of the paragraphs
for paragraph in paragraphs:
sentences = tokenizer.tokenize(paragraph)
This is a simplified version of what I had in production, but the general idea is the same. And, sorry about the comments and docstring in portuguese, this was done in 'educational purposes' for brazilian audience
def paragraphs(self):
if self._paragraphs is not None:
for p in self._paragraphs:
yield p
else:
raw_paras = self.raw_text.split(self.paragraph_delimiter)
gen = (Paragraph(self, p) for p in raw_paras if p)
self._paragraphs = []
for p in gen:
self._paragraphs.append(p)
yield p
full code https://gitorious.org/restjor/restjor/source/4d684ea4f18f66b097be1e10cc8814736888dfb4:restjor/decomposition.py#Lundefined
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With