nltk sentence tokenizer, consider new lines as sentence boundary

Tags:

I am using nltk's PunkSentenceTokenizer to tokenize a text to a set of sentences. However, the tokenizer doesn't seem to consider new paragraph or new lines as a new sentence.

>>> from nltk.tokenize.punkt import PunktSentenceTokenizer
>>> tokenizer = PunktSentenceTokenizer()
>>> tokenizer.tokenize('Sentence 1 \n Sentence 2. Sentence 3.')
['Sentence 1 \n Sentence 2.', 'Sentence 3.']
>>> tokenizer.span_tokenize('Sentence 1 \n Sentence 2. Sentence 3.')
[(0, 24), (25, 36)]

I would like it to to consider new lines as a boundary of sentences as well. Anyway to do this (I need to save the offsets too)?

391

asked Mar 13 '15 20:03

CentAu

1 Answers

Well, I had the same problem and what I have done was split the text in '\n'. Something like this:

# in my case, when it had '\n', I called it a new paragraph, 
# like a collection of sentences
paragraphs = [p for p in text.split('\n') if p]
# and here, sent_tokenize each one of the paragraphs
for paragraph in paragraphs:
    sentences = tokenizer.tokenize(paragraph)

This is a simplified version of what I had in production, but the general idea is the same. And, sorry about the comments and docstring in portuguese, this was done in 'educational purposes' for brazilian audience

def paragraphs(self):
    if self._paragraphs is not None:
        for p in  self._paragraphs:
            yield p
    else:
        raw_paras = self.raw_text.split(self.paragraph_delimiter)
        gen = (Paragraph(self, p) for p in raw_paras if p)
        self._paragraphs = []
        for p in gen:
            self._paragraphs.append(p)
            yield p

full code https://gitorious.org/restjor/restjor/source/4d684ea4f18f66b097be1e10cc8814736888dfb4:restjor/decomposition.py#Lundefined

answered Oct 13 '22 01:10

Juca

Related questions
                            
                                Using Flask-SQLAlchemy without Flask
                            
                                Controlling distance of shuffling
                            
                                How can I efficiently read and write files that are too large to fit in memory?
                            
                                Restoring TensorFlow model
                            
                                What is the difference between partial fit and warm start?
                            
                                How to specify metadata for dask.dataframe
                            
                                Package installed by Conda, Python cannot find it
                            
                                Difference between score and accuracy_score in sklearn
                            
                                Normalized Cross-Correlation in Python
                            
                                Plotly: How to define the structure of a sankey diagram using a pandas dataframe?
                            
                                Why is training a random forest regressor with MAE criterion so slow compared to MSE?
                            
                                Mechanize and BeautifulSoup for PHP? [closed]
                            
                                What is the difference between BaseHTTPServer and SimpleHTTPServer? When and where to use them?
                            
                                Commenting JavaScript functions á la Python Docstrings
                            
                                How to find XML Elements via XPath in Python in a namespace-agnostic way?
                            
                                Can't install Kivy: Cython/GCC error
                            
                                Access variables of caller function in Python
                            
                                Matplotlib setting title bold while using "Times New Roman"
                            
                                How to Open a file through python
                            
                                Is it possible to access enclosing context manager?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

nltk sentence tokenizer, consider new lines as sentence boundary

Tags:

python

tokenize

nlp

nltk

CentAu

People also ask

1 Answers

Juca

Recent Activity

Donate For Us