<pre class="prettyprint"><code>import nltk >>> nltk.__version__ '3.0.4' >>> nltk.word_tokenize('"') ['``'] >>> nltk.word_tokenize('""') ['``', '``'] >>> nltk.word_tokenize('"A"') ['``', 'A', "''"] </code></pre> See how it changes <code>"</code> to a double `` and <code>''</code>? What's happening here? Why is it changing the character? Is there a fix? As I need to search for each token in the string later on. Python 2.7.6 if it makes any difference.

TL;DR: <code>nltk.word_tokenize</code> changes starting double quotes changes from <code>" -> ``</code> and ending double quotes from <code>" -> ''</code>. <hr> In long: First the <code>nltk.word_tokenize</code> tokenizes base on how Penn TreeBank was tokenized, it comes from <code>nltk.tokenize.treebank</code>, see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L91 and https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L23 <pre class="prettyprint"><code>class TreebankWordTokenizer(TokenizerI): """ The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This is the method that is invoked by ``word_tokenize()``. It assumes that the text has already been segmented into sentences, e.g. using ``sent_tokenize()``. </code></pre> Then comes a list of regex replacements for contractions at https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L48, it comes from the "Robert MacIntyre's tokenizer", i.e. https://www.cis.upenn.edu/~treebank/tokenizer.sed The contractions splits words like "gonna", "wanna", etc.: <pre class="prettyprint"><code>>>> from nltk import word_tokenize >>> word_tokenize("I wanna go home") ['I', 'wan', 'na', 'go', 'home'] >>> word_tokenize("I gonna go home") ['I', 'gon', 'na', 'go', 'home'] </code></pre> After that we reach the punctuation part that you're asking about, see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L63: <pre class="prettyprint"><code>def tokenize(self, text): #starting quotes text = re.sub(r'^\"', r'``', text) text = re.sub(r'(``)', r' \1 ', text) text = re.sub(r'([ (\[{<])"', r'\1 `` ', text) </code></pre> Ah ha, starting quotes changes from "->``: <pre class="prettyprint"><code>>>> import re >>> text = '"A"' >>> re.sub(r'^\"', r'``', text) '``A"' KeyboardInterrupt >>> re.sub(r'(``)', r' \1 ', re.sub(r'^\"', r'``', text)) ' `` A"' >>> re.sub(r'([ (\[{<])"', r'\1 `` ', re.sub(r'(``)', r' \1 ', re.sub(r'^\"', r'``', text))) ' `` A"' >>> text_after_startquote_changes = re.sub(r'([ (\[{<])"', r'\1 `` ', re.sub(r'(``)', r' \1 ', re.sub(r'^\"', r'``', text))) >>> text_after_startquote_changes ' `` A"' </code></pre> Then we see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L85 that deals with ending quotes: <pre class="prettyprint"><code> #ending quotes text = re.sub(r'"', " '' ", text) text = re.sub(r'(\S)(\'\')', r'\1 \2 ', text) </code></pre> Applying the regexes: <pre class="prettyprint"><code>>>> re.sub(r'"', " '' ", text_after_startquote_changes) " `` A '' " >>> re.sub(r'(\S)(\'\')', r'\1 \2 ', re.sub(r'"', " '' ", text_after_startquote_changes)) " `` A '' " </code></pre> <hr> So if you want to search the list of tokens for double quotes after <code>nltk.word_tokenize</code>, simply search for <code>``</code> and <code>''</code> instead of <code>"</code>.

NLTK word tokenize behaviour for double quotation marks is confusing

import nltk
>>> nltk.__version__
'3.0.4'
>>> nltk.word_tokenize('"')
['``']
>>> nltk.word_tokenize('""')
['``', '``']
>>> nltk.word_tokenize('"A"')
['``', 'A', "''"]

See how it changes " to a double `` and ''?

What's happening here? Why is it changing the character? Is there a fix? As I need to search for each token in the string later on.

Python 2.7.6 if it makes any difference.

Does NLTK Tokenize remove punctuation?

Nothing happens with the text. The workflow assumed by NLTK is that you first tokenize into sentences and then every sentence into words. That is why word_tokenize() does not work with multiple sentences. To get rid of the punctuation, you can use a regular expression or python's isalnum() function.

Which Tokenizer is available in NLTK for word tokenization?

NLTK contains a module called tokenize() which further classifies into two sub-categories: Word tokenize: We use the word_tokenize() method to split a sentence into tokens or words. Sentence tokenize: We use the sent_tokenize() method to split a document or paragraph into sentences.

What does NLTK function word_tokenize () do?

NLTK provides a function called word_tokenize() for splitting strings into tokens (nominally words). It splits tokens based on white space and punctuation. For example, commas and periods are taken as separate tokens.

TL;DR:

nltk.word_tokenize changes starting double quotes changes from " -> `` and ending double quotes from " -> ''.

In long:

First the nltk.word_tokenize tokenizes base on how Penn TreeBank was tokenized, it comes from nltk.tokenize.treebank, see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L91 and https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L23

class TreebankWordTokenizer(TokenizerI):
    """
    The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank.
    This is the method that is invoked by ``word_tokenize()``.  It assumes that the
    text has already been segmented into sentences, e.g. using ``sent_tokenize()``.

Then comes a list of regex replacements for contractions at https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L48, it comes from the "Robert MacIntyre's tokenizer", i.e. https://www.cis.upenn.edu/~treebank/tokenizer.sed

The contractions splits words like "gonna", "wanna", etc.:

>>> from nltk import word_tokenize
>>> word_tokenize("I wanna go home")
['I', 'wan', 'na', 'go', 'home']
>>> word_tokenize("I gonna go home")
['I', 'gon', 'na', 'go', 'home']

After that we reach the punctuation part that you're asking about, see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L63:

def tokenize(self, text):
    #starting quotes
    text = re.sub(r'^\"', r'``', text)
    text = re.sub(r'(``)', r' \1 ', text)
    text = re.sub(r'([ (\[{<])"', r'\1 `` ', text)

Ah ha, starting quotes changes from "->``:

>>> import re
>>> text = '"A"'
>>> re.sub(r'^\"', r'``', text)
'``A"'
KeyboardInterrupt
>>> re.sub(r'(``)', r' \1 ', re.sub(r'^\"', r'``', text))
' `` A"'
>>> re.sub(r'([ (\[{<])"', r'\1 `` ', re.sub(r'(``)', r' \1 ', re.sub(r'^\"', r'``', text)))
' `` A"'
>>> text_after_startquote_changes = re.sub(r'([ (\[{<])"', r'\1 `` ', re.sub(r'(``)', r' \1 ', re.sub(r'^\"', r'``', text)))
>>> text_after_startquote_changes
' `` A"'

Then we see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L85 that deals with ending quotes:

    #ending quotes
    text = re.sub(r'"', " '' ", text)
    text = re.sub(r'(\S)(\'\')', r'\1 \2 ', text)

Applying the regexes:

>>> re.sub(r'"', " '' ", text_after_startquote_changes)
" `` A '' "
>>> re.sub(r'(\S)(\'\')', r'\1 \2 ', re.sub(r'"', " '' ", text_after_startquote_changes))
" `` A '' "

So if you want to search the list of tokens for double quotes after nltk.word_tokenize, simply search for `` and '' instead of ".

NLTK word tokenize behaviour for double quotation marks is confusing

Tags:

python

nltk

mota

People also ask

1 Answers

alvas

Recent Activity

Donate For Us

NLTK word tokenize behaviour for double quotation marks is confusing

Tags:

python

nltk

mota

People also ask

1 Answers

alvas

Related questions

Recent Activity

Donate For Us