Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting sentences with nltk while preserving quotes

I am using nltk to split a text into sentence units. However, I need the sentences that contain quotes to be extracted as a single unit. Right now each sentence, even if it is within a quote is getting extracted as a separate part.

This is an example of something that I am trying to extract as a single unit:

"This is a sentence. This is also a sentence," said the cat.

Right now I have this code:

import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

text = 'This is a sentence. This is also a sentence," said the cat.'

print '\n-----\n'.join(tokenizer.tokenize(text, realign_boundaries=True))

This works pretty well, but I want to maintain sentences with quotes in them even when the quotes themselves contain multiple sentences.

The code above produces:

This is a sentence.
-----
This is also a sentence," said the cat.

I am trying to get that whole text extracted as a single unit:

"This is a sentence. This is also a sentence," said the cat.

Is there an easy way to do this with nltk or should I use regex instead? I was impressed with how easy it was to get started with nltk, but am stuck now.

like image 572
e h Avatar asked Nov 12 '13 15:11

e h


1 Answers

If I understand the problem correctly, then this regex should do it:

import re

text = '"This is a sentence. This is also a sentence," said the cat.'

for grp in re.findall(r'"[^"]*\."|("[^"]*")*([^".]*\.)', text):
    print "".join(grp)

It's a combination of 2 patterns or'd together. The first one finds ordinary quoted sentences. The second finds ordinary sentences or sentences with a quotation followed by a period. If you have more complex sentences it may need some further adjusting.

like image 55
Harold Ship Avatar answered Sep 29 '22 17:09

Harold Ship