<p>I'm using the NLTK's PUNKT sentence tokenizer to split a file into a list of sentences, and would like to preserve the empty lines within the file:</p> <pre class="prettyprint"><code>from nltk import data tokenizer = data.load('tokenizers/punkt/english.pickle') s = "That was a very loud beep.\n\n I don't even know\n if this is working. Mark?\n\n Mark are you there?\n\n\n" sentences = tokenizer.tokenize(s) print sentences </code></pre> <p>I would like this to print:</p> <pre class="prettyprint"><code>['That was a very loud beep.\n\n', "I don't even know\n if this is working.", 'Mark?\n\n', 'Mark are you there?\n\n\n'] </code></pre> <p>But the content that's actually printed shows that the trailing empty lines have been removed from the first and third sentences:</p> <pre class="prettyprint"><code>['That was a very loud beep.', "I don't even know\n if this is working.", 'Mark?', 'Mark are you there?\n\n\n'] </code></pre> <p>Other tokenizers in NLTK have a <code>blanklines='keep'</code> parameter, but I don't see any such option in the case of the Punkt tokenizer. It's very possible I'm missing something simple. Is there a way to retrain these trailing empty lines using the Punkt sentence tokenizer? I'd be grateful for any insights others can offer!</p>

<h3>The problem</h3> <p>Sadly, you can't make the tokenizer keep the blanklines, not with the way the it is written.</p> <p>Starting here and following the function calls through span_tokenize() and _slices_from_text(), you can see there is a condition </p> <p><code>if match.group('next_tok'):</code> </p> <p>that is designed to ensure the tokenizer skips whitespace until the next possible sentence starting token occurs. Looking for the regex this refers to, we end up looking at _period_context_fmt, where we see that the <code>next_tok</code> named group is preceded by <code>\s+</code>, where blanklines won't be captured.</p> <h3>The solution</h3> <p>Break it down, change the part that you don't like, reassemble your custom solution.</p> <p>Now this regex is in the PunktLanguageVars class, itself used to initialize the PunktSentenceTokenizer class. We just have to derive a custom class from PunktLanguageVars and fix the regex the way we want it to be.</p> <p>The fix we want is to include trailing newlines at the end of a sentence, so I suggest replacing the <code>_period_context_fmt</code>, going from this:</p> <pre class="prettyprint"><code>_period_context_fmt = r""" \S* # some word material %(SentEndChars)s # a potential sentence ending (?=(?P<after_tok> %(NonWord)s # either other punctuation | \s+(?P<next_tok>\S+) # or whitespace and some other token ))""" </code></pre> <p>to this:</p> <pre class="prettyprint"><code>_period_context_fmt = r""" \S* # some word material %(SentEndChars)s # a potential sentence ending \s* # <-- THIS is what I changed (?=(?P<after_tok> %(NonWord)s # either other punctuation | (?P<next_tok>\S+) # <-- Normally you would have \s+ here ))""" </code></pre> <p>Now a tokenizer using this regex instead of the older will include 0 or more <code>\s</code> characters after the end of a sentence.</p> <h3>The whole script</h3> <pre class="prettyprint"><code>import nltk.tokenize.punkt as pkt class CustomLanguageVars(pkt.PunktLanguageVars): _period_context_fmt = r""" \S* # some word material %(SentEndChars)s # a potential sentence ending \s* # <-- THIS is what I changed (?=(?P<after_tok> %(NonWord)s # either other punctuation | (?P<next_tok>\S+) # <-- Normally you would have \s+ here ))""" custom_tknzr = pkt.PunktSentenceTokenizer(lang_vars=CustomLanguageVars()) s = "That was a very loud beep.\n\n I don't even know\n if this is working. Mark?\n\n Mark are you there?\n\n\n" print(custom_tknzr.tokenize(s)) </code></pre> <p>This outputs:</p> <pre class="prettyprint"><code>['That was a very loud beep.\n\n ', "I don't even know\n if this is working. ", 'Mark?\n\n ', 'Mark are you there?\n\n\n'] </code></pre>

Preserve empty lines with NLTK's Punkt Tokenizer

Tags:

python

newline

line-breaks

nlp

nltk

I'm using the NLTK's PUNKT sentence tokenizer to split a file into a list of sentences, and would like to preserve the empty lines within the file:

from nltk import data
tokenizer = data.load('tokenizers/punkt/english.pickle')
s = "That was a very loud beep.\n\n I don't even know\n if this is working. Mark?\n\n Mark are you there?\n\n\n"
sentences = tokenizer.tokenize(s)
print sentences

I would like this to print:

['That was a very loud beep.\n\n', "I don't even know\n if this is working.", 'Mark?\n\n', 'Mark are you there?\n\n\n']

But the content that's actually printed shows that the trailing empty lines have been removed from the first and third sentences:

['That was a very loud beep.', "I don't even know\n if this is working.", 'Mark?', 'Mark are you there?\n\n\n']

Other tokenizers in NLTK have a blanklines='keep' parameter, but I don't see any such option in the case of the Punkt tokenizer. It's very possible I'm missing something simple. Is there a way to retrain these trailing empty lines using the Punkt sentence tokenizer? I'd be grateful for any insights others can offer!

818

asked Oct 15 '15 03:10

duhaime

2 Answers

The problem

Sadly, you can't make the tokenizer keep the blanklines, not with the way the it is written.

Starting here and following the function calls through span_tokenize() and _slices_from_text(), you can see there is a condition

if match.group('next_tok'):

that is designed to ensure the tokenizer skips whitespace until the next possible sentence starting token occurs. Looking for the regex this refers to, we end up looking at _period_context_fmt, where we see that the next_tok named group is preceded by \s+, where blanklines won't be captured.

The solution

Break it down, change the part that you don't like, reassemble your custom solution.

Now this regex is in the PunktLanguageVars class, itself used to initialize the PunktSentenceTokenizer class. We just have to derive a custom class from PunktLanguageVars and fix the regex the way we want it to be.

The fix we want is to include trailing newlines at the end of a sentence, so I suggest replacing the _period_context_fmt, going from this:

_period_context_fmt = r"""
    \S*                          # some word material
    %(SentEndChars)s             # a potential sentence ending
    (?=(?P<after_tok>
        %(NonWord)s              # either other punctuation
        |
        \s+(?P<next_tok>\S+)     # or whitespace and some other token
    ))"""

to this:

_period_context_fmt = r"""
    \S*                          # some word material
    %(SentEndChars)s             # a potential sentence ending
    \s*                       #  <-- THIS is what I changed
    (?=(?P<after_tok>
        %(NonWord)s              # either other punctuation
        |
        (?P<next_tok>\S+)     #  <-- Normally you would have \s+ here
    ))"""

Now a tokenizer using this regex instead of the older will include 0 or more \s characters after the end of a sentence.

The whole script

import nltk.tokenize.punkt as pkt

class CustomLanguageVars(pkt.PunktLanguageVars):

    _period_context_fmt = r"""
        \S*                          # some word material
        %(SentEndChars)s             # a potential sentence ending
        \s*                       #  <-- THIS is what I changed
        (?=(?P<after_tok>
            %(NonWord)s              # either other punctuation
            |
            (?P<next_tok>\S+)     #  <-- Normally you would have \s+ here
        ))"""

custom_tknzr = pkt.PunktSentenceTokenizer(lang_vars=CustomLanguageVars())

s = "That was a very loud beep.\n\n I don't even know\n if this is working. Mark?\n\n Mark are you there?\n\n\n"

print(custom_tknzr.tokenize(s))

This outputs:

['That was a very loud beep.\n\n ', "I don't even know\n if this is working. ", 'Mark?\n\n ', 'Mark are you there?\n\n\n']

121

answered Sep 25 '22 03:09

HugoMailhot

Split the input into paragraphs, splitting on a capturing regexp (which returns the captured string as well):

paras = re.split("(\n\s*\n)", sentences)

You can then apply nltk.sent_tokenize() to the individual paragraphs, and process the results by paragraph or flatten the list-- whatever best suits your further use.

sents_by_para = [ nltk.sent_tokenize(p) for p in paras ]
flat = [ sent for par in sents_by_para for sent in par ]

(It seems that sent_tokenize() doesn't mangle whitespace-only strings, so there's no need to check and exclude them from processing.)

If you specifically want to have the whitespace attached to the previous sentence, you can easily stick it back on:

collapsed = []
for s in flat:
    if s.isspace() and len(collapsed) > 0:
        collapsed[-1] += s
    else:
        collapsed.append(s)

answered Sep 22 '22 03:09

alexis

Related questions
                            
                                Reading multiple file using thread/multiprocess
                            
                                Comparing Python Decimals created from float and string
                            
                                Using flask wtforms validators without using a form
                            
                                How can I pad and/or truncate a vector to a specified length using numpy?
                            
                                JSON keys are shuffled in Python
                            
                                WxPython: PyInstaller fails with No module named _core_
                            
                                Python PIL bitmap/png from array with mode=1
                            
                                "python manage.py runserver" vs "django-admin runserver"
                            
                                In Python, how can an image stored as a NumPy array be scaled in size?
                            
                                Scikit Learn - Calculating TF-IDF from a corpus of arrays of features instead of from a corpus of raw documents
                            
                                Import error no module named zlib (brew installed python)
                            
                                Python. How to get the x,y coordinates of a offset spline from a x,y list of points and offset distance
                            
                                Django override bulk_create
                            
                                Python: Assertion error, "not called"
                            
                                OpenCV's waitKey() alternative in IPython Notebook
                            
                                Psycopg2 - AttributeError: 'NoneType' object has no attribute 'fetchall'
                            
                                Querying Pandas DataFrame with column name that contains a space or using the drop method with a column name that contains a space
                            
                                An elegant way to make a 2d array with all possible columns
                            
                                how do I commit and push to github from python shell?
                            
                                In python, can you pass variadic arguments after named parameters?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With