I know this might sound easy. I thought about using the first dot(.) which comes as the benchmark, but when abbreviations and short forms come, I am rendered helpless.
e.g. -
Sir Winston Leonard Spencer-Churchill, KG, OM, CH, TD, PC, DL, FRS, Hon. RA (30 November 1874 – 24 January 1965) was a British politician and statesman known for his leadership of the United Kingdom during the Second World War. He is widely regarded as one of the great wartime leaders and served as Prime Minister twice. A noted statesman and orator, Churchill was also an officer in the British Army, a historian, a writer, and an artist.
Here, the 1st dot is Hon., but I want the complete first line ending at Second World War .
Is it possible people ???
If you use nltk
you can add abbreviations, like this:
>>> import nltk
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> sent_detector._params.abbrev_types.add('hon')
>>> sent_detector.tokenize(your_text)
['Sir Winston Leonard Spencer-Churchill, KG, OM, CH, TD, PC, DL, FRS, Hon. RA
(30 November 1874 \xe2\x80\x93 24 January 1965) was a British politician and
statesman known for his leadership of the United Kingdom during the Second
World War.',
'He is widely regarded as one of the great wartime leaders and served as Prime
Minister twice.',
'A noted statesman and orator, Churchill was also an officer in the British Army,
a historian, a writer, and an artist.']
This approach is based on Kiss & Strunk 2006, which reports that the F-score (harmonic mean of precision and recall) is between 91% and 99% for Punkt, depending on the test corpus.
Kiss, Tibor, and Jan Strunk. 2006. "Unsupervised Multilingual Sentence Boundary Detection". Computational Linguistics, (32) 485-525.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With