I am using the <code>nltk</code> to split up sentences to words. e.g. <pre class="prettyprint"><code> nltk.word_tokenize("The code didn't work!") -> ['The', 'code', 'did', "n't", 'work', '!'] </code></pre> The tokenizing works well at spliting up word boundaries [i.e. splitting punctuation from words], but sometimes over-splits, and modifiers at the end of the word get treated as separate parts. For example, <code>didn't</code> gets split into the parts <code>did</code> and <code>n't</code> and <code>i've</code> gets split to <code>I</code> and <code>'ve</code>. Obviously this is because such words are split in two in the original corpus that <code>nltk</code> is using, and may be desirable in some instances. Is there any built in way of over-riding this behavior? Possibly in a similar manner to how <code>nltk's</code> <code>MWETokenizer</code> is able to aggregate multiple words to phrases, but in this case to just aggregate word components to words. Alternatively, is there another tokenizer that does not split up word-parts?

This is actually working as expected: <blockquote> That is the correct/expected output. For word tokenization contractions are considered two words because meaning-wise they are. </blockquote> Different <code>nltk</code> tokenizers handle English language contractions differently. For instance, I've found that <code>TweetTokenizer</code> does not split the contraction into two parts: <pre class="prettyprint"><code>>>> from nltk.tokenize import TweetTokenizer >>> tknzr = TweetTokenizer() >>> tknzr.tokenize("The code didn't work!") [u'The', u'code', u"didn't", u'work', u'!'] </code></pre> Please see more information and workarounds at: <ul> <li>nltk tokenization and contractions</li> <li>Expanding English language contractions in Python</li> <li>word_tokenizer separates contractions (we'll, I'll) into different words</li> </ul>

Preventing splitting at apostrophies when tokenizing words using nltk

Tags:

python

nltk

I am using the nltk to split up sentences to words. e.g.

 nltk.word_tokenize("The code didn't work!")
 -> ['The', 'code', 'did', "n't", 'work', '!']

The tokenizing works well at spliting up word boundaries [i.e. splitting punctuation from words], but sometimes over-splits, and modifiers at the end of the word get treated as separate parts. For example, didn't gets split into the parts did and n't and i've gets split to I and 've. Obviously this is because such words are split in two in the original corpus that nltk is using, and may be desirable in some instances.

Is there any built in way of over-riding this behavior? Possibly in a similar manner to how nltk's MWETokenizer is able to aggregate multiple words to phrases, but in this case to just aggregate word components to words.

Alternatively, is there another tokenizer that does not split up word-parts?

391

asked Jan 11 '16 04:01

kyrenia

1 Answers

This is actually working as expected:

That is the correct/expected output. For word tokenization contractions are considered two words because meaning-wise they are.

Different nltk tokenizers handle English language contractions differently. For instance, I've found that TweetTokenizer does not split the contraction into two parts:

>>> from nltk.tokenize import TweetTokenizer
>>> tknzr = TweetTokenizer()
>>> tknzr.tokenize("The code didn't work!")
[u'The', u'code', u"didn't", u'work', u'!']

Please see more information and workarounds at:

nltk tokenization and contractions
Expanding English language contractions in Python
word_tokenizer separates contractions (we'll, I'll) into different words

112

answered Oct 13 '22 23:10

alecxe

Related questions
                            
                                reading struct in python from created struct in c
                            
                                boolean and type checking in python vs numpy
                            
                                Pre-populating a BooleanField as checked (WTForms)
                            
                                Stop cssutils from generating warning messages
                            
                                How to exclude rows/columns from numpy.ndarray data
                            
                                How to pass along username and password to cassandra in python
                            
                                Testing for KeyError
                            
                                Close pyplot figure using the keyboard on Mac OS X
                            
                                django app in heroku getting worker timeout error
                            
                                python moving multiple files from one folder to the other based on text characters in file name
                            
                                How to install xmlrpclib in python 3.4?
                            
                                ImproperlyConfigured: settings.DATABASES is improperly configured. Please supply the ENGINE value
                            
                                numpy.vectorize returns incorrect values
                            
                                Will a Python dict literal be evaluated in the order it is written?
                            
                                Traversing a list of lists by index within a loop, to reformat strings
                            
                                What dtype to use for money representation in pandas dataframe?
                            
                                Use Scikit Learn to do linear regression on a time series pandas data frame
                            
                                pandas: Convert Series of DataFrames to single DataFrame
                            
                                max([x for x in something]) vs max(x for x in something): why is there a difference and what is it?
                            
                                How do I write/create a GeoTIFF RGB image file in python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With