Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the first sentence from the following paragraph?

I know this might sound easy. I thought about using the first dot(.) which comes as the benchmark, but when abbreviations and short forms come, I am rendered helpless.

e.g. -

Sir Winston Leonard Spencer-Churchill, KG, OM, CH, TD, PC, DL, FRS, Hon. RA (30 November 1874 – 24 January 1965) was a British politician and statesman known for his leadership of the United Kingdom during the Second World War. He is widely regarded as one of the great wartime leaders and served as Prime Minister twice. A noted statesman and orator, Churchill was also an officer in the British Army, a historian, a writer, and an artist.

Here, the 1st dot is Hon., but I want the complete first line ending at Second World War .

Is it possible people ???

like image 418
sammyiitkgp Avatar asked Jan 16 '23 09:01

sammyiitkgp


1 Answers

If you use nltk you can add abbreviations, like this:

>>> import nltk
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> sent_detector._params.abbrev_types.add('hon')
>>> sent_detector.tokenize(your_text)
['Sir Winston Leonard Spencer-Churchill, KG, OM, CH, TD, PC, DL, FRS, Hon. RA 
(30 November 1874 \xe2\x80\x93 24 January 1965) was a British politician and 
statesman known for his leadership of the United Kingdom during the Second 
World War.', 
'He is widely regarded as one of the great wartime leaders and served as Prime 
Minister twice.', 
'A noted statesman and orator, Churchill was also an officer in the British Army,
a historian, a writer, and an artist.']

This approach is based on Kiss & Strunk 2006, which reports that the F-score (harmonic mean of precision and recall) is between 91% and 99% for Punkt, depending on the test corpus.

Kiss, Tibor, and Jan Strunk. 2006. "Unsupervised Multilingual Sentence Boundary Detection". Computational Linguistics, (32) 485-525.

like image 171
fraxel Avatar answered Jan 19 '23 00:01

fraxel