I am using nltk to split a text into sentence units. However, I need the sentences that contain quotes to be extracted as a single unit. Right now each sentence, even if it is within a quote is getting extracted as a separate part.
This is an example of something that I am trying to extract as a single unit:
"This is a sentence. This is also a sentence," said the cat.
Right now I have this code:
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
text = 'This is a sentence. This is also a sentence," said the cat.'
print '\n-----\n'.join(tokenizer.tokenize(text, realign_boundaries=True))
This works pretty well, but I want to maintain sentences with quotes in them even when the quotes themselves contain multiple sentences.
The code above produces:
This is a sentence.
-----
This is also a sentence," said the cat.
I am trying to get that whole text extracted as a single unit:
"This is a sentence. This is also a sentence," said the cat.
Is there an easy way to do this with nltk or should I use regex instead? I was impressed with how easy it was to get started with nltk, but am stuck now.
If I understand the problem correctly, then this regex should do it:
import re
text = '"This is a sentence. This is also a sentence," said the cat.'
for grp in re.findall(r'"[^"]*\."|("[^"]*")*([^".]*\.)', text):
print "".join(grp)
It's a combination of 2 patterns or'd together. The first one finds ordinary quoted sentences. The second finds ordinary sentences or sentences with a quotation followed by a period. If you have more complex sentences it may need some further adjusting.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With