import nltk
>>> nltk.__version__
'3.0.4'
>>> nltk.word_tokenize('"')
['``']
>>> nltk.word_tokenize('""')
['``', '``']
>>> nltk.word_tokenize('"A"')
['``', 'A', "''"]
See how it changes "
to a double `` and ''
?
What's happening here? Why is it changing the character? Is there a fix? As I need to search for each token in the string later on.
Python 2.7.6 if it makes any difference.
Nothing happens with the text. The workflow assumed by NLTK is that you first tokenize into sentences and then every sentence into words. That is why word_tokenize() does not work with multiple sentences. To get rid of the punctuation, you can use a regular expression or python's isalnum() function.
NLTK contains a module called tokenize() which further classifies into two sub-categories: Word tokenize: We use the word_tokenize() method to split a sentence into tokens or words. Sentence tokenize: We use the sent_tokenize() method to split a document or paragraph into sentences.
NLTK provides a function called word_tokenize() for splitting strings into tokens (nominally words). It splits tokens based on white space and punctuation. For example, commas and periods are taken as separate tokens.
TL;DR:
nltk.word_tokenize
changes starting double quotes changes from " -> ``
and ending double quotes from " -> ''
.
In long:
First the nltk.word_tokenize
tokenizes base on how Penn TreeBank was tokenized, it comes from nltk.tokenize.treebank
, see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L91 and https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L23
class TreebankWordTokenizer(TokenizerI):
"""
The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank.
This is the method that is invoked by ``word_tokenize()``. It assumes that the
text has already been segmented into sentences, e.g. using ``sent_tokenize()``.
Then comes a list of regex replacements for contractions at https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L48, it comes from the "Robert MacIntyre's tokenizer", i.e. https://www.cis.upenn.edu/~treebank/tokenizer.sed
The contractions splits words like "gonna", "wanna", etc.:
>>> from nltk import word_tokenize
>>> word_tokenize("I wanna go home")
['I', 'wan', 'na', 'go', 'home']
>>> word_tokenize("I gonna go home")
['I', 'gon', 'na', 'go', 'home']
After that we reach the punctuation part that you're asking about, see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L63:
def tokenize(self, text):
#starting quotes
text = re.sub(r'^\"', r'``', text)
text = re.sub(r'(``)', r' \1 ', text)
text = re.sub(r'([ (\[{<])"', r'\1 `` ', text)
Ah ha, starting quotes changes from "->``:
>>> import re
>>> text = '"A"'
>>> re.sub(r'^\"', r'``', text)
'``A"'
KeyboardInterrupt
>>> re.sub(r'(``)', r' \1 ', re.sub(r'^\"', r'``', text))
' `` A"'
>>> re.sub(r'([ (\[{<])"', r'\1 `` ', re.sub(r'(``)', r' \1 ', re.sub(r'^\"', r'``', text)))
' `` A"'
>>> text_after_startquote_changes = re.sub(r'([ (\[{<])"', r'\1 `` ', re.sub(r'(``)', r' \1 ', re.sub(r'^\"', r'``', text)))
>>> text_after_startquote_changes
' `` A"'
Then we see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L85 that deals with ending quotes:
#ending quotes
text = re.sub(r'"', " '' ", text)
text = re.sub(r'(\S)(\'\')', r'\1 \2 ', text)
Applying the regexes:
>>> re.sub(r'"', " '' ", text_after_startquote_changes)
" `` A '' "
>>> re.sub(r'(\S)(\'\')', r'\1 \2 ', re.sub(r'"', " '' ", text_after_startquote_changes))
" `` A '' "
So if you want to search the list of tokens for double quotes after nltk.word_tokenize
, simply search for ``
and ''
instead of "
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With