I want to extract sentences of text, but I need exact position of results. Current implementation of tokenize.sent_tokenize
in NLTK doesn't return position of extracted sentences so I tried something like this:
offset, length = 0, 0
for sentence in tokenize.sent_tokenize(text):
length = len(sentence)
yield sentence, offset, length
offset += length
But it doesn't return exact position of sentences because sent_tokenize
removes some writing characters (e.g. newline, extra spaces and ...) outside resulted sentence boundary. I don't want to use a simple regex pattern for splitting sentences and I know in that case this problem is trivial.
Thanks.
You could use PunktSentenceTokenizer
directly (it is used to implement sent_tokenize()
):
from nltk.tokenize.punkt import PunktSentenceTokenizer
text = 'Rabbit say to itself "Oh dear! Oh dear! I shall be too late!"'
for start, end in PunktSentenceTokenizer().span_tokenize(text):
length = end - start
print buffer(text, start, length), start, length
You could use text[start:end]
instead of buffer(text, start, end - start)
if you don't mind copying of each sentence.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With