Finding exact position of tokenized sentences

Question

I want to extract sentences of text, but I need exact position of results. Current implementation of tokenize.sent_tokenize in NLTK doesn't return position of extracted sentences so I tried something like this:

offset, length = 0, 0
for sentence in tokenize.sent_tokenize(text):
    length = len(sentence)
    yield sentence, offset, length
    offset += length

But it doesn't return exact position of sentences because sent_tokenize removes some writing characters (e.g. newline, extra spaces and ...) outside resulted sentence boundary. I don't want to use a simple regex pattern for splitting sentences and I know in that case this problem is trivial.

Thanks.

jfs · Accepted Answer

You could use PunktSentenceTokenizer directly (it is used to implement sent_tokenize()):

from nltk.tokenize.punkt import PunktSentenceTokenizer

text = 'Rabbit say to itself "Oh dear! Oh dear! I shall be too late!"'
for start, end in PunktSentenceTokenizer().span_tokenize(text):
    length = end - start
    print buffer(text, start, length), start, length

You could use text[start:end] instead of buffer(text, start, end - start) if you don't mind copying of each sentence.

Finding exact position of tokenized sentences

Tags:

python

tokenize

nltk

nournia

1 Answers

jfs

Recent Activity

Donate For Us

Finding exact position of tokenized sentences

Tags:

python

tokenize

nltk

nournia

1 Answers

jfs

Related questions

Recent Activity

Donate For Us