Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Preventing splitting at apostrophies when tokenizing words using nltk

Tags:

python

nltk

I am using the nltk to split up sentences to words. e.g.

 nltk.word_tokenize("The code didn't work!")
 -> ['The', 'code', 'did', "n't", 'work', '!']

The tokenizing works well at spliting up word boundaries [i.e. splitting punctuation from words], but sometimes over-splits, and modifiers at the end of the word get treated as separate parts. For example, didn't gets split into the parts did and n't and i've gets split to I and 've. Obviously this is because such words are split in two in the original corpus that nltk is using, and may be desirable in some instances.

Is there any built in way of over-riding this behavior? Possibly in a similar manner to how nltk's MWETokenizer is able to aggregate multiple words to phrases, but in this case to just aggregate word components to words.

Alternatively, is there another tokenizer that does not split up word-parts?

like image 391
kyrenia Avatar asked Jan 11 '16 04:01

kyrenia


People also ask

Is the process of Tokenizing splitting text into characters words or sentences?

Tokenization is the process of tokenizing or splitting a string, text into a list of tokens. One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph. How sent_tokenize works ? The sent_tokenize function uses an instance of PunktSentenceTokenizer from the nltk.

Does NLTK Tokenize remove punctuation?

Use nltk. word_tokenize() and list comprehension to remove all punctuation marks.

What does word_tokenize () function in NLTK do?

word_tokenize() method, we are able to extract the tokens from string of characters by using tokenize. word_tokenize() method. It actually returns the syllables from a single word. A single word can contain one or two syllables.


1 Answers

This is actually working as expected:

That is the correct/expected output. For word tokenization contractions are considered two words because meaning-wise they are.

Different nltk tokenizers handle English language contractions differently. For instance, I've found that TweetTokenizer does not split the contraction into two parts:

>>> from nltk.tokenize import TweetTokenizer
>>> tknzr = TweetTokenizer()
>>> tknzr.tokenize("The code didn't work!")
[u'The', u'code', u"didn't", u'work', u'!']

Please see more information and workarounds at:

  • nltk tokenization and contractions
  • Expanding English language contractions in Python
  • word_tokenizer separates contractions (we'll, I'll) into different words
like image 112
alecxe Avatar answered Oct 13 '22 23:10

alecxe