How to tweak the NLTK sentence tokenizer

Tags:

I'm using NLTK to analyze a few classic texts and I'm running in to trouble tokenizing the text by sentence. For example, here's what I get for a snippet from Moby Dick:

import nltk sent_tokenize = nltk.data.load('tokenizers/punkt/english.pickle')  ''' (Chapter 16) A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but that's a rather cold and clammy reception in the winter time, ain't it, Mrs. Hussey?" ''' sample = 'A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but that\'s a rather cold and clammy reception in the winter time, ain\'t it, Mrs. Hussey?"'  print "\n-----\n".join(sent_tokenize.tokenize(sample)) ''' OUTPUT "A clam for supper? ----- a cold clam; is THAT what you mean, Mrs. ----- Hussey? ----- " says I, "but that\'s a rather cold and clammy reception in the winter time, ain\'t it, Mrs. ----- Hussey? ----- " '''

I don't expect perfection here, considering that Melville's syntax is a bit dated, but NLTK ought to be able to handle terminal double quotes and titles like "Mrs." Since the tokenizer is the result of an unsupervised training algo, however, I can't figure out how to tinker with it.

Anyone have recommendations for a better sentence tokenizer? I'd prefer a simple heuristic that I can hack rather than having to train my own parser.

701

asked Dec 30 '12 23:12

Chris Wilson

2 Answers

You need to supply a list of abbreviations to the tokenizer, like so:

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters punkt_param = PunktParameters() punkt_param.abbrev_types = set(['dr', 'vs', 'mr', 'mrs', 'prof', 'inc']) sentence_splitter = PunktSentenceTokenizer(punkt_param) text = "is THAT what you mean, Mrs. Hussey?" sentences = sentence_splitter.tokenize(text)

sentences is now:

['is THAT what you mean, Mrs. Hussey?']

Update: This does not work if the last word of the sentence has an apostrophe or a quotation mark attached to it (like Hussey?'). So a quick-and-dirty way around this is to put spaces in front of apostrophes and quotes that follow sentence-end symbols (.!?):

text = text.replace('?"', '? "').replace('!"', '! "').replace('."', '. "')

answered Sep 20 '22 13:09

vpekar

You can modify the NLTK's pre-trained English sentence tokenizer to recognize more abbreviations by adding them to the set _params.abbrev_types. For example:

extra_abbreviations = ['dr', 'vs', 'mr', 'mrs', 'prof', 'inc', 'i.e'] sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') sentence_tokenizer._params.abbrev_types.update(extra_abbreviations)

Note that the abbreviations must be specified without the final period, but do include any internal periods, as in 'i.e' above. For details about the other tokenizer parameters, refer to the relevant documentation.

answered Sep 20 '22 13:09

bjmc

Related questions
                            
                                How can I check if a checkbox is checked in Selenium Python WebDriver?
                            
                                What base_name parameter do I need in my route to make this Django API work?
                            
                                Python selenium: wait until element is clickable - not working
                            
                                Tkinter tkFileDialog doesn't exist [duplicate]
                            
                                Where is the history file for ipython
                            
                                Unpack list to variables
                            
                                pip error while installing Python: "Ignoring ensurepip failure: pip 8.1.1 requires SSL/TLS"
                            
                                PYODBC--Data source name not found and no default driver specified
                            
                                how to copy s3 object from one bucket to another using python boto3
                            
                                Retrieve the two highest item from a list containing 100,000 integers
                            
                                Python: How to make an option to be required in optparse?
                            
                                How efficient/fast is Python's 'in'? (Time Complexity wise)
                            
                                How to fix "module 'platform' has no attribute 'linux_distribution'" when installing new packages with Python3.8?
                            
                                Print all Unique Values in a Python Dictionary
                            
                                How to split a byte string into separate bytes in python
                            
                                Multiple level template inheritance in Jinja2?
                            
                                Copy an entity in Google App Engine datastore in Python without knowing property names at 'compile' time
                            
                                How to make a custom exception class with multiple init args pickleable
                            
                                How do I manipulate bits in Python?
                            
                                What's the most Pythonic way of determining endianness?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to tweak the NLTK sentence tokenizer

Tags:

python

nlp

nltk

Chris Wilson

People also ask

2 Answers

vpekar

bjmc

Recent Activity

Donate For Us