I'm using NLTK to analyze a few classic texts and I'm running in to trouble tokenizing the text by sentence. For example, here's what I get for a snippet from Moby Dick:
import nltk sent_tokenize = nltk.data.load('tokenizers/punkt/english.pickle') ''' (Chapter 16) A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but that's a rather cold and clammy reception in the winter time, ain't it, Mrs. Hussey?" ''' sample = 'A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but that\'s a rather cold and clammy reception in the winter time, ain\'t it, Mrs. Hussey?"' print "\n-----\n".join(sent_tokenize.tokenize(sample)) ''' OUTPUT "A clam for supper? ----- a cold clam; is THAT what you mean, Mrs. ----- Hussey? ----- " says I, "but that\'s a rather cold and clammy reception in the winter time, ain\'t it, Mrs. ----- Hussey? ----- " '''
I don't expect perfection here, considering that Melville's syntax is a bit dated, but NLTK ought to be able to handle terminal double quotes and titles like "Mrs." Since the tokenizer is the result of an unsupervised training algo, however, I can't figure out how to tinker with it.
Anyone have recommendations for a better sentence tokenizer? I'd prefer a simple heuristic that I can hack rather than having to train my own parser.
NLTK contains a module called tokenize() which further classifies into two sub-categories: Word tokenize: We use the word_tokenize() method to split a sentence into tokens or words. Sentence tokenize: We use the sent_tokenize() method to split a document or paragraph into sentences.
This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.
NLTK provides a function called word_tokenize() for splitting strings into tokens (nominally words). It splits tokens based on white space and punctuation. For example, commas and periods are taken as separate tokens.
'] punkt is the required package for tokenization. Hence you may download it using nltk download manager or download it programmatically using nltk. download('punkt') .
You need to supply a list of abbreviations to the tokenizer, like so:
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters punkt_param = PunktParameters() punkt_param.abbrev_types = set(['dr', 'vs', 'mr', 'mrs', 'prof', 'inc']) sentence_splitter = PunktSentenceTokenizer(punkt_param) text = "is THAT what you mean, Mrs. Hussey?" sentences = sentence_splitter.tokenize(text)
sentences is now:
['is THAT what you mean, Mrs. Hussey?']
Update: This does not work if the last word of the sentence has an apostrophe or a quotation mark attached to it (like Hussey?'). So a quick-and-dirty way around this is to put spaces in front of apostrophes and quotes that follow sentence-end symbols (.!?):
text = text.replace('?"', '? "').replace('!"', '! "').replace('."', '. "')
You can modify the NLTK's pre-trained English sentence tokenizer to recognize more abbreviations by adding them to the set _params.abbrev_types
. For example:
extra_abbreviations = ['dr', 'vs', 'mr', 'mrs', 'prof', 'inc', 'i.e'] sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') sentence_tokenizer._params.abbrev_types.update(extra_abbreviations)
Note that the abbreviations must be specified without the final period, but do include any internal periods, as in 'i.e'
above. For details about the other tokenizer parameters, refer to the relevant documentation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With