My program takes a text file and splits each sentence into a list using split('.')
meaning that it will split when it registers a full stop however it can be inaccurate.
str='i love carpets. In fact i own 2.4 km of the stuff.'
listOfSentences = ['i love carpets', 'in fact i own 2', '4 km of the stuff']
listOfSentences = ['i love carpets', 'in fact i own 2.4 km of the stuff']
My question is: How do I split the end of sentences and not at every full stop.
Any regex based approach cannot handle cases like "I saw Mr. Smith.", and adding hacks for those cases is not scalable. As user est has commented, any serious implementation uses data.
If you need to handle English only then spaCy is better than NLTK:
from spacy.en import English
en = English()
doc = en(u'i love carpets. In fact i own 2.4 km of the stuff.')
for s in list(doc.sents):
print s.string
Update: spaCy now supports many languages.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With