Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Preserve empty lines with NLTK's Punkt Tokenizer

I'm using the NLTK's PUNKT sentence tokenizer to split a file into a list of sentences, and would like to preserve the empty lines within the file:

from nltk import data
tokenizer = data.load('tokenizers/punkt/english.pickle')
s = "That was a very loud beep.\n\n I don't even know\n if this is working. Mark?\n\n Mark are you there?\n\n\n"
sentences = tokenizer.tokenize(s)
print sentences

I would like this to print:

['That was a very loud beep.\n\n', "I don't even know\n if this is working.", 'Mark?\n\n', 'Mark are you there?\n\n\n']

But the content that's actually printed shows that the trailing empty lines have been removed from the first and third sentences:

['That was a very loud beep.', "I don't even know\n if this is working.", 'Mark?', 'Mark are you there?\n\n\n']

Other tokenizers in NLTK have a blanklines='keep' parameter, but I don't see any such option in the case of the Punkt tokenizer. It's very possible I'm missing something simple. Is there a way to retrain these trailing empty lines using the Punkt sentence tokenizer? I'd be grateful for any insights others can offer!

like image 818
duhaime Avatar asked Oct 15 '15 03:10

duhaime


People also ask

How does Punkt sentence tokenizer work?

This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.

What does NLTK's function word_tokenize () do?

word_tokenize is a function in Python that splits a given sentence into words using the NLTK library.

Does NLTK tokenize remove punctuation?

Overviews of NLTK Remove PunctuationWhen a sentence is tokenized, and all punctuation marks are removed from it, all punctuation marks are removed from each word.

What does word_tokenize return?

word_tokenize() method, we are able to extract the tokens from string of characters by using tokenize. word_tokenize() method. It actually returns the syllables from a single word.


2 Answers

The problem

Sadly, you can't make the tokenizer keep the blanklines, not with the way the it is written.

Starting here and following the function calls through span_tokenize() and _slices_from_text(), you can see there is a condition

if match.group('next_tok'):

that is designed to ensure the tokenizer skips whitespace until the next possible sentence starting token occurs. Looking for the regex this refers to, we end up looking at _period_context_fmt, where we see that the next_tok named group is preceded by \s+, where blanklines won't be captured.

The solution

Break it down, change the part that you don't like, reassemble your custom solution.

Now this regex is in the PunktLanguageVars class, itself used to initialize the PunktSentenceTokenizer class. We just have to derive a custom class from PunktLanguageVars and fix the regex the way we want it to be.

The fix we want is to include trailing newlines at the end of a sentence, so I suggest replacing the _period_context_fmt, going from this:

_period_context_fmt = r"""
    \S*                          # some word material
    %(SentEndChars)s             # a potential sentence ending
    (?=(?P<after_tok>
        %(NonWord)s              # either other punctuation
        |
        \s+(?P<next_tok>\S+)     # or whitespace and some other token
    ))"""

to this:

_period_context_fmt = r"""
    \S*                          # some word material
    %(SentEndChars)s             # a potential sentence ending
    \s*                       #  <-- THIS is what I changed
    (?=(?P<after_tok>
        %(NonWord)s              # either other punctuation
        |
        (?P<next_tok>\S+)     #  <-- Normally you would have \s+ here
    ))"""

Now a tokenizer using this regex instead of the older will include 0 or more \s characters after the end of a sentence.

The whole script

import nltk.tokenize.punkt as pkt

class CustomLanguageVars(pkt.PunktLanguageVars):

    _period_context_fmt = r"""
        \S*                          # some word material
        %(SentEndChars)s             # a potential sentence ending
        \s*                       #  <-- THIS is what I changed
        (?=(?P<after_tok>
            %(NonWord)s              # either other punctuation
            |
            (?P<next_tok>\S+)     #  <-- Normally you would have \s+ here
        ))"""

custom_tknzr = pkt.PunktSentenceTokenizer(lang_vars=CustomLanguageVars())

s = "That was a very loud beep.\n\n I don't even know\n if this is working. Mark?\n\n Mark are you there?\n\n\n"

print(custom_tknzr.tokenize(s))

This outputs:

['That was a very loud beep.\n\n ', "I don't even know\n if this is working. ", 'Mark?\n\n ', 'Mark are you there?\n\n\n']
like image 121
HugoMailhot Avatar answered Sep 25 '22 03:09

HugoMailhot


Split the input into paragraphs, splitting on a capturing regexp (which returns the captured string as well):

paras = re.split("(\n\s*\n)", sentences)

You can then apply nltk.sent_tokenize() to the individual paragraphs, and process the results by paragraph or flatten the list-- whatever best suits your further use.

sents_by_para = [ nltk.sent_tokenize(p) for p in paras ]
flat = [ sent for par in sents_by_para for sent in par ]

(It seems that sent_tokenize() doesn't mangle whitespace-only strings, so there's no need to check and exclude them from processing.)

If you specifically want to have the whitespace attached to the previous sentence, you can easily stick it back on:

collapsed = []
for s in flat:
    if s.isspace() and len(collapsed) > 0:
        collapsed[-1] += s
    else:
        collapsed.append(s)
like image 40
alexis Avatar answered Sep 22 '22 03:09

alexis