Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLTK word tokenize behaviour for double quotation marks is confusing

Tags:

python

nltk

import nltk
>>> nltk.__version__
'3.0.4'
>>> nltk.word_tokenize('"')
['``']
>>> nltk.word_tokenize('""')
['``', '``']
>>> nltk.word_tokenize('"A"')
['``', 'A', "''"]

See how it changes " to a double `` and ''?

What's happening here? Why is it changing the character? Is there a fix? As I need to search for each token in the string later on.

Python 2.7.6 if it makes any difference.

like image 984
mota Avatar asked Aug 24 '15 14:08

mota


People also ask

Does NLTK Tokenize remove punctuation?

Nothing happens with the text. The workflow assumed by NLTK is that you first tokenize into sentences and then every sentence into words. That is why word_tokenize() does not work with multiple sentences. To get rid of the punctuation, you can use a regular expression or python's isalnum() function.

Which Tokenizer is available in NLTK for word tokenization?

NLTK contains a module called tokenize() which further classifies into two sub-categories: Word tokenize: We use the word_tokenize() method to split a sentence into tokens or words. Sentence tokenize: We use the sent_tokenize() method to split a document or paragraph into sentences.

What does NLTK function word_tokenize () do?

NLTK provides a function called word_tokenize() for splitting strings into tokens (nominally words). It splits tokens based on white space and punctuation. For example, commas and periods are taken as separate tokens.


1 Answers

TL;DR:

nltk.word_tokenize changes starting double quotes changes from " -> `` and ending double quotes from " -> ''.


In long:

First the nltk.word_tokenize tokenizes base on how Penn TreeBank was tokenized, it comes from nltk.tokenize.treebank, see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L91 and https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L23

class TreebankWordTokenizer(TokenizerI):
    """
    The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank.
    This is the method that is invoked by ``word_tokenize()``.  It assumes that the
    text has already been segmented into sentences, e.g. using ``sent_tokenize()``.

Then comes a list of regex replacements for contractions at https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L48, it comes from the "Robert MacIntyre's tokenizer", i.e. https://www.cis.upenn.edu/~treebank/tokenizer.sed

The contractions splits words like "gonna", "wanna", etc.:

>>> from nltk import word_tokenize
>>> word_tokenize("I wanna go home")
['I', 'wan', 'na', 'go', 'home']
>>> word_tokenize("I gonna go home")
['I', 'gon', 'na', 'go', 'home']

After that we reach the punctuation part that you're asking about, see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L63:

def tokenize(self, text):
    #starting quotes
    text = re.sub(r'^\"', r'``', text)
    text = re.sub(r'(``)', r' \1 ', text)
    text = re.sub(r'([ (\[{<])"', r'\1 `` ', text)

Ah ha, starting quotes changes from "->``:

>>> import re
>>> text = '"A"'
>>> re.sub(r'^\"', r'``', text)
'``A"'
KeyboardInterrupt
>>> re.sub(r'(``)', r' \1 ', re.sub(r'^\"', r'``', text))
' `` A"'
>>> re.sub(r'([ (\[{<])"', r'\1 `` ', re.sub(r'(``)', r' \1 ', re.sub(r'^\"', r'``', text)))
' `` A"'
>>> text_after_startquote_changes = re.sub(r'([ (\[{<])"', r'\1 `` ', re.sub(r'(``)', r' \1 ', re.sub(r'^\"', r'``', text)))
>>> text_after_startquote_changes
' `` A"'

Then we see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L85 that deals with ending quotes:

    #ending quotes
    text = re.sub(r'"', " '' ", text)
    text = re.sub(r'(\S)(\'\')', r'\1 \2 ', text)

Applying the regexes:

>>> re.sub(r'"', " '' ", text_after_startquote_changes)
" `` A '' "
>>> re.sub(r'(\S)(\'\')', r'\1 \2 ', re.sub(r'"', " '' ", text_after_startquote_changes))
" `` A '' "

So if you want to search the list of tokens for double quotes after nltk.word_tokenize, simply search for `` and '' instead of ".

like image 84
alvas Avatar answered Oct 23 '22 06:10

alvas