Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Accurately splitting sentences

My program takes a text file and splits each sentence into a list using split('.') meaning that it will split when it registers a full stop however it can be inaccurate.

For Example

str='i love carpets. In fact i own 2.4 km of the stuff.'

Output

listOfSentences = ['i love carpets', 'in fact i own 2', '4 km of the stuff']

Desired Output

 listOfSentences = ['i love carpets', 'in fact i own 2.4 km of the stuff']

My question is: How do I split the end of sentences and not at every full stop.

like image 578
Marko Avatar asked Nov 27 '22 15:11

Marko


1 Answers

Any regex based approach cannot handle cases like "I saw Mr. Smith.", and adding hacks for those cases is not scalable. As user est has commented, any serious implementation uses data.

If you need to handle English only then spaCy is better than NLTK:

from spacy.en import English
en = English()
doc = en(u'i love carpets. In fact i own 2.4 km of the stuff.')
for s in list(doc.sents):
    print s.string

Update: spaCy now supports many languages.

like image 173
Adam Bittlingmayer Avatar answered Dec 18 '22 13:12

Adam Bittlingmayer