Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

nltk sentence tokenizer, consider new lines as sentence boundary

I am using nltk's PunkSentenceTokenizer to tokenize a text to a set of sentences. However, the tokenizer doesn't seem to consider new paragraph or new lines as a new sentence.

>>> from nltk.tokenize.punkt import PunktSentenceTokenizer
>>> tokenizer = PunktSentenceTokenizer()
>>> tokenizer.tokenize('Sentence 1 \n Sentence 2. Sentence 3.')
['Sentence 1 \n Sentence 2.', 'Sentence 3.']
>>> tokenizer.span_tokenize('Sentence 1 \n Sentence 2. Sentence 3.')
[(0, 24), (25, 36)]

I would like it to to consider new lines as a boundary of sentences as well. Anyway to do this (I need to save the offsets too)?

like image 391
CentAu Avatar asked Mar 13 '15 20:03

CentAu


People also ask

How does sentence tokenizer work?

Sentence tokenization is the process of splitting text into individual sentences. For literature, journalism, and formal documents the tokenization algorithms built in to spaCy perform well, since the tokenizer is trained on a corpus of formal English text.

How do you Tokenize a sentence using NLTK?

NLTK contains a module called tokenize() which further classifies into two sub-categories: Word tokenize: We use the word_tokenize() method to split a sentence into tokens or words. Sentence tokenize: We use the sent_tokenize() method to split a document or paragraph into sentences.

How is str split () different from Word tokenizer?

tokenize() ,which returns a list, will ignore empty string (when a delimiter appears twice in succession) where as split() keeps such string. The split() can take regex as delimiter where as tokenize does not.

What does NLTK tokenizer do?

NLTK Tokenization is used for parsing a large amount of textual data into parts to perform an analysis of the character of the text. NLTK for tokenization can be used for training machine learning models, Natural Language Processing text cleaning.


1 Answers

Well, I had the same problem and what I have done was split the text in '\n'. Something like this:

# in my case, when it had '\n', I called it a new paragraph, 
# like a collection of sentences
paragraphs = [p for p in text.split('\n') if p]
# and here, sent_tokenize each one of the paragraphs
for paragraph in paragraphs:
    sentences = tokenizer.tokenize(paragraph)

This is a simplified version of what I had in production, but the general idea is the same. And, sorry about the comments and docstring in portuguese, this was done in 'educational purposes' for brazilian audience

def paragraphs(self):
    if self._paragraphs is not None:
        for p in  self._paragraphs:
            yield p
    else:
        raw_paras = self.raw_text.split(self.paragraph_delimiter)
        gen = (Paragraph(self, p) for p in raw_paras if p)
        self._paragraphs = []
        for p in gen:
            self._paragraphs.append(p)
            yield p

full code https://gitorious.org/restjor/restjor/source/4d684ea4f18f66b097be1e10cc8814736888dfb4:restjor/decomposition.py#Lundefined

like image 90
Juca Avatar answered Oct 13 '22 01:10

Juca