Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding exact position of tokenized sentences

I want to extract sentences of text, but I need exact position of results. Current implementation of tokenize.sent_tokenize in NLTK doesn't return position of extracted sentences so I tried something like this:

offset, length = 0, 0
for sentence in tokenize.sent_tokenize(text):
    length = len(sentence)
    yield sentence, offset, length
    offset += length

But it doesn't return exact position of sentences because sent_tokenize removes some writing characters (e.g. newline, extra spaces and ...) outside resulted sentence boundary. I don't want to use a simple regex pattern for splitting sentences and I know in that case this problem is trivial.

Thanks.

like image 532
nournia Avatar asked Dec 27 '22 09:12

nournia


1 Answers

You could use PunktSentenceTokenizer directly (it is used to implement sent_tokenize()):

from nltk.tokenize.punkt import PunktSentenceTokenizer

text = 'Rabbit say to itself "Oh dear! Oh dear! I shall be too late!"'
for start, end in PunktSentenceTokenizer().span_tokenize(text):
    length = end - start
    print buffer(text, start, length), start, length

You could use text[start:end] instead of buffer(text, start, end - start) if you don't mind copying of each sentence.

like image 142
jfs Avatar answered Jan 10 '23 00:01

jfs