How to avoid NLTK's sentence tokenizer splitting on abbreviations?

Tags:

I'm currently using NLTK for language processing, but I have encountered a problem of sentence tokenizing.

Here's the problem: Assume I have a sentence: "Fig. 2 shows a U.S.A. map." When I use punkt tokenizer, my code looks like this:

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
punkt_param = PunktParameters()
abbreviation = ['U.S.A', 'fig']
punkt_param.abbrev_types = set(abbreviation)
tokenizer = PunktSentenceTokenizer(punkt_param)
tokenizer.tokenize('Fig. 2 shows a U.S.A. map.')

It returns this:

['Fig. 2 shows a U.S.A.', 'map.']

The tokenizer can't detect the abbreviation "U.S.A.", but it worked on "fig". Now when I use the default tokenizer NLTK provides:

import nltk
nltk.tokenize.sent_tokenize('Fig. 2 shows a U.S.A. map.')

This time I get:

['Fig.', '2 shows a U.S.A. map.']

It recognizes the more common "U.S.A." but fails to see "fig"!

How can I combine these two methods? I want to use default abbreviation choices as well as adding my own abbreviations.

568

asked Jan 15 '16 07:01

joe wong

1 Answers

I think lower case for u.s.a in abbreviations list will work fine for you Try this,

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
punkt_param = PunktParameters()
abbreviation = ['u.s.a', 'fig']
punkt_param.abbrev_types = set(abbreviation)
tokenizer = PunktSentenceTokenizer(punkt_param)
tokenizer.tokenize('Fig. 2 shows a U.S.A. map.')

It returns this to me:

['Fig. 2 shows a U.S.A. map.']

109

answered Sep 21 '22 12:09

Prashant Puri

Related questions
                            
                                How to resolve 'str' has no attribute 'maketrans' error in python?
                            
                                Basics of Simulated Annealing in Python [closed]
                            
                                Deprecation warning in scikit-learn svmlight format loader
                            
                                Is it reliable to compare two isoformat datetime strings?
                            
                                Matrix multiplication on CPU (numpy) and GPU (gnumpy) give different results
                            
                                Python path as a string [closed]
                            
                                Stuffing a pandas DataFrame.plot into a matplotlib subplot
                            
                                Memory-aware LRU caching in Python?
                            
                                Pandas - Delete Rows with only NaN values
                            
                                Python AttributeError: 'module' object has no attribute 'connect'
                            
                                Datetime Timezone conversion using pytz
                            
                                Regex, select closest match
                            
                                How can I share a class between processes?
                            
                                How do you add error bars to Bokeh plots in python?
                            
                                Difference(s) between scipy.stats.linregress, numpy.polynomial.polynomial.polyfit and statsmodels.api.OLS
                            
                                Find the year with the most number of people alive in Python
                            
                                Curl POST request into pycurl code
                            
                                Python3 threading with uWSGI
                            
                                One object two foreign keys to the same table
                            
                                How does Pandas to_sql determine what dataframe column is placed into what database field?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to avoid NLTK's sentence tokenizer splitting on abbreviations?

Tags:

python

tokenize

nlp

nltk

joe wong

People also ask

1 Answers

Prashant Puri

Recent Activity

Donate For Us