I wish to split text into sentences. Can anyone help me?
I also need to handle abbreviations. However my plan is to replace these at an earlier stage. Mr. -> Mister
import re
import unittest
class Sentences:
def __init__(self,text):
self.sentences = tuple(re.split("[.!?]\s", text))
class TestSentences(unittest.TestCase):
def testFullStop(self):
self.assertEquals(Sentences("X. X.").sentences, ("X.","X."))
def testQuestion(self):
self.assertEquals(Sentences("X? X?").sentences, ("X?","X?"))
def testExclaimation(self):
self.assertEquals(Sentences("X! X!").sentences, ("X!","X!"))
def testMixed(self):
self.assertEquals(Sentences("X! X? X! X.").sentences, ("X!", "X?", "X!", "X."))
Thanks, Barry
EDIT: To start with, I would be happy to satisfy the four tests I've included above. This would help me understand better how regexs work. For now I can define a sentence as X. etc as defined in my tests.
Sentence Segmentation can be a very difficult task, especially when the text contains dotted abbreviations. it may require a use of lists of known abbreviations, or training classifier to recognize them.
I suggest you to use NLTK - it a suite of open source Python modules, designed for natural language processing.
You can read about Sentence Segmentation using NLTK here, and decide for yourself if this tool fits you.
EDITED: or even simpler here and here is the source code. This is The Punkt sentence tokenizer, included in NLTK.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With