Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to tweak the NLTK sentence tokenizer

Tags:

python

nlp

nltk

I'm using NLTK to analyze a few classic texts and I'm running in to trouble tokenizing the text by sentence. For example, here's what I get for a snippet from Moby Dick:

import nltk sent_tokenize = nltk.data.load('tokenizers/punkt/english.pickle')  ''' (Chapter 16) A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but that's a rather cold and clammy reception in the winter time, ain't it, Mrs. Hussey?" ''' sample = 'A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but that\'s a rather cold and clammy reception in the winter time, ain\'t it, Mrs. Hussey?"'  print "\n-----\n".join(sent_tokenize.tokenize(sample)) ''' OUTPUT "A clam for supper? ----- a cold clam; is THAT what you mean, Mrs. ----- Hussey? ----- " says I, "but that\'s a rather cold and clammy reception in the winter time, ain\'t it, Mrs. ----- Hussey? ----- " ''' 

I don't expect perfection here, considering that Melville's syntax is a bit dated, but NLTK ought to be able to handle terminal double quotes and titles like "Mrs." Since the tokenizer is the result of an unsupervised training algo, however, I can't figure out how to tinker with it.

Anyone have recommendations for a better sentence tokenizer? I'd prefer a simple heuristic that I can hack rather than having to train my own parser.

like image 701
Chris Wilson Avatar asked Dec 30 '12 23:12

Chris Wilson


People also ask

How do you Tokenize a sentence using the nltk package?

NLTK contains a module called tokenize() which further classifies into two sub-categories: Word tokenize: We use the word_tokenize() method to split a sentence into tokens or words. Sentence tokenize: We use the sent_tokenize() method to split a document or paragraph into sentences.

How does Punkt sentence tokenizer work?

This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.

What does nltk function word_tokenize () do?

NLTK provides a function called word_tokenize() for splitting strings into tokens (nominally words). It splits tokens based on white space and punctuation. For example, commas and periods are taken as separate tokens.

What is nltk Download (' Punkt ')?

'] punkt is the required package for tokenization. Hence you may download it using nltk download manager or download it programmatically using nltk. download('punkt') .


2 Answers

You need to supply a list of abbreviations to the tokenizer, like so:

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters punkt_param = PunktParameters() punkt_param.abbrev_types = set(['dr', 'vs', 'mr', 'mrs', 'prof', 'inc']) sentence_splitter = PunktSentenceTokenizer(punkt_param) text = "is THAT what you mean, Mrs. Hussey?" sentences = sentence_splitter.tokenize(text) 

sentences is now:

['is THAT what you mean, Mrs. Hussey?'] 

Update: This does not work if the last word of the sentence has an apostrophe or a quotation mark attached to it (like Hussey?'). So a quick-and-dirty way around this is to put spaces in front of apostrophes and quotes that follow sentence-end symbols (.!?):

text = text.replace('?"', '? "').replace('!"', '! "').replace('."', '. "') 
like image 96
vpekar Avatar answered Sep 20 '22 13:09

vpekar


You can modify the NLTK's pre-trained English sentence tokenizer to recognize more abbreviations by adding them to the set _params.abbrev_types. For example:

extra_abbreviations = ['dr', 'vs', 'mr', 'mrs', 'prof', 'inc', 'i.e'] sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') sentence_tokenizer._params.abbrev_types.update(extra_abbreviations) 

Note that the abbreviations must be specified without the final period, but do include any internal periods, as in 'i.e' above. For details about the other tokenizer parameters, refer to the relevant documentation.

like image 45
bjmc Avatar answered Sep 20 '22 13:09

bjmc